Context The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Objective:

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

As a Data scientist at Thera bank, we need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

We need to identify the best possible model that will give the required performance

Data Dictionary

CLIENTNUM: Client number. Unique identifier for the customer holding the account

Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"

Customer_Age: Age in Years

Gender: Gender of the account holder

Dependent_count: Number of dependents

Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.

Marital_Status: Marital Status of the account holder

Income_Category: Annual Income Category of the account holder

Card_Category: Type of Card

Months_on_book: Period of relationship with the bank Total_Relationship_Count: Total no. of products held by the customer

Months_Inactive_12_mon: No. of months inactive in the last 12 months

Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months

Credit_Limit: Credit Limit on the Credit Card

Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance

Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)

Total_Trans_Amt: Total Transaction Amount (Last 12 months)

Total_Trans_Ct: Total Transaction Count (Last 12 months)

Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter

Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

In [249]:
import pandas as pd
import numpy as np
In [250]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
In [251]:
from sklearn.impute import SimpleImputer
In [252]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
In [ ]:
from xgboost import XGBClassifier

!pip install lightgbm
import lightgbm as lgb
Requirement already satisfied: lightgbm in /usr/local/lib/python3.11/dist-packages (4.5.0)
Requirement already satisfied: numpy>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from lightgbm) (1.26.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from lightgbm) (1.13.1)
/usr/local/lib/python3.11/dist-packages/dask/dataframe/__init__.py:42: FutureWarning: 
Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

  warnings.warn(msg, FutureWarning)
In [ ]:
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
)
# Import plot_confusion_matrix and plot_roc_curve from their updated locations
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay

# You can then use them like this:
# ConfusionMatrixDisplay.from_estimator(estimator, X_test, y_test)
# or
# ConfusionMatrixDisplay.from_predictions(y_true, y_pred)

# Similarly for plot_roc_curve:
# RocCurveDisplay.from_estimator(estimator, X_test, y_test)
# or
# RocCurveDisplay.from_predictions(y_true, y_pred)
In [ ]:
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    OneHotEncoder,
    RobustScaler,
)
In [ ]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import TransformerMixin

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
In [ ]:
!pip install ydata-profiling # Install the package providing pandas profiling functionality
from ydata_profiling import ProfileReport # Update the import statement to reflect the correct package and module name
Collecting ydata-profiling
  Downloading ydata_profiling-4.12.2-py2.py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: scipy<1.16,>=1.4.1 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (1.13.1)
Requirement already satisfied: pandas!=1.4.0,<3,>1.1 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (2.2.2)
Requirement already satisfied: matplotlib>=3.5 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (3.10.0)
Requirement already satisfied: pydantic>=2 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (2.10.6)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (6.0.2)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (3.1.5)
Collecting visions<0.8.0,>=0.7.5 (from visions[type_image_path]<0.8.0,>=0.7.5->ydata-profiling)
  Downloading visions-0.7.6-py3-none-any.whl.metadata (11 kB)
Requirement already satisfied: numpy<2.2,>=1.16.0 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (1.26.4)
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... done
Collecting phik<0.13,>=0.11.1 (from ydata-profiling)
  Downloading phik-0.12.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Requirement already satisfied: requests<3,>=2.24.0 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (2.32.3)
Requirement already satisfied: tqdm<5,>=4.48.2 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (4.67.1)
Requirement already satisfied: seaborn<0.14,>=0.10.1 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (0.13.2)
Collecting multimethod<2,>=1.4 (from ydata-profiling)
  Downloading multimethod-1.12-py3-none-any.whl.metadata (9.6 kB)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (0.14.4)
Requirement already satisfied: typeguard<5,>=3 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (4.4.1)
Collecting imagehash==4.3.1 (from ydata-profiling)
  Downloading ImageHash-4.3.1-py2.py3-none-any.whl.metadata (8.0 kB)
Requirement already satisfied: wordcloud>=1.9.3 in /usr/local/lib/python3.11/dist-packages (from ydata-profiling) (1.9.4)
Collecting dacite>=1.8 (from ydata-profiling)
  Downloading dacite-1.9.2-py3-none-any.whl.metadata (17 kB)
Collecting PyWavelets (from imagehash==4.3.1->ydata-profiling)
  Downloading pywavelets-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.0 kB)
Requirement already satisfied: pillow in /usr/local/lib/python3.11/dist-packages (from imagehash==4.3.1->ydata-profiling) (11.1.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2<3.2,>=2.11.1->ydata-profiling) (3.0.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling) (24.2)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2025.1)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.11/dist-packages (from phik<0.13,>=0.11.1->ydata-profiling) (1.4.2)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2->ydata-profiling) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2->ydata-profiling) (2.27.2)
Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2->ydata-profiling) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (2025.1.31)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.11/dist-packages (from statsmodels<1,>=0.13.2->ydata-profiling) (1.0.1)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.11/dist-packages (from visions<0.8.0,>=0.7.5->visions[type_image_path]<0.8.0,>=0.7.5->ydata-profiling) (25.1.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.11/dist-packages (from visions<0.8.0,>=0.7.5->visions[type_image_path]<0.8.0,>=0.7.5->ydata-profiling) (3.4.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.7->matplotlib>=3.5->ydata-profiling) (1.17.0)
Downloading ydata_profiling-4.12.2-py2.py3-none-any.whl (390 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 390.9/390.9 kB 12.1 MB/s eta 0:00:00
Downloading ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 296.5/296.5 kB 23.4 MB/s eta 0:00:00
Downloading dacite-1.9.2-py3-none-any.whl (16 kB)
Downloading multimethod-1.12-py3-none-any.whl (10 kB)
Downloading phik-0.12.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (687 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 687.8/687.8 kB 40.3 MB/s eta 0:00:00
Downloading visions-0.7.6-py3-none-any.whl (104 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 104.8/104.8 kB 9.0 MB/s eta 0:00:00
Downloading pywavelets-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 70.5 MB/s eta 0:00:00
Building wheels for collected packages: htmlmin
  Building wheel for htmlmin (setup.py) ... done
  Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27081 sha256=6898b6954b7ed52e06b339bd054264c83a85dfbaf98d834aedac3e3e6887ec85
  Stored in directory: /root/.cache/pip/wheels/8d/55/1a/19cd535375ed1ede0c996405ebffe34b196d78e2d9545723a2
Successfully built htmlmin
Installing collected packages: htmlmin, PyWavelets, multimethod, dacite, imagehash, visions, phik, ydata-profiling
Successfully installed PyWavelets-1.8.0 dacite-1.9.2 htmlmin-0.1.12 imagehash-4.3.1 multimethod-1.12 phik-0.12.4 visions-0.7.6 ydata-profiling-4.12.2

Loadind Data Set

In [ ]:
df = pd.read_csv("/content/BankChurners.csv")
In [ ]:
df.shape
Out[ ]:
(10127, 21)

Observation: Dataset has 10127 columns and 21 rows

In [ ]:
df.head()
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000
In [ ]:
df.tail()
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 3 2 3 4003.000 1851 2152.000 0.703 15476 117 0.857 0.462
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 4 2 3 4277.000 2186 2091.000 0.804 8764 69 0.683 0.511
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 5 3 4 5409.000 0 5409.000 0.819 10291 60 0.818 0.000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 4 3 3 5281.000 0 5281.000 0.535 8395 62 0.722 0.000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 6 2 4 10388.000 1961 8427.000 0.703 10294 61 0.649 0.189
In [ ]:
df.describe()
Out[ ]:
CLIENTNUM Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
count 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000
mean 739177606.334 46.326 2.346 35.928 3.813 2.341 2.455 8631.954 1162.814 7469.140 0.760 4404.086 64.859 0.712 0.275
std 36903783.450 8.017 1.299 7.986 1.554 1.011 1.106 9088.777 814.987 9090.685 0.219 3397.129 23.473 0.238 0.276
min 708082083.000 26.000 0.000 13.000 1.000 0.000 0.000 1438.300 0.000 3.000 0.000 510.000 10.000 0.000 0.000
25% 713036770.500 41.000 1.000 31.000 3.000 2.000 2.000 2555.000 359.000 1324.500 0.631 2155.500 45.000 0.582 0.023
50% 717926358.000 46.000 2.000 36.000 4.000 2.000 2.000 4549.000 1276.000 3474.000 0.736 3899.000 67.000 0.702 0.176
75% 773143533.000 52.000 3.000 40.000 5.000 3.000 3.000 11067.500 1784.000 9859.000 0.859 4741.000 81.000 0.818 0.503
max 828343083.000 73.000 5.000 56.000 6.000 6.000 6.000 34516.000 2517.000 34516.000 3.397 18484.000 139.000 3.714 0.999
In [ ]:
additional_droppable_columns = [
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'
]

for col in additional_droppable_columns:
    if col in df.columns.unique().tolist():
        df.drop(columns=[col], inplace=True)
In [ ]:
data = df.copy()
In [ ]:
data.head()
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000
In [ ]:
data.tail()
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 3 2 3 4003.000 1851 2152.000 0.703 15476 117 0.857 0.462
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 4 2 3 4277.000 2186 2091.000 0.804 8764 69 0.683 0.511
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 5 3 4 5409.000 0 5409.000 0.819 10291 60 0.818 0.000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 4 3 3 5281.000 0 5281.000 0.535 8395 62 0.722 0.000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 6 2 4 10388.000 1961 8427.000 0.703 10294 61 0.649 0.189
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

Observation: It is onserved that education level and marital status is less then 10127

In [ ]:
data.duplicated().sum()
Out[ ]:
0

Observation: IT is observed that there is 0 duplicated values

In [ ]:
print("Missing Values Count per Column:")
print(df.isnull().sum())
Missing Values Count per Column:
CLIENTNUM                      0
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64

Observation: It is observed that except education level and marital status all other values are zeros

Unique Categorial Variable

In [ ]:
data.select_dtypes(include="object").nunique()
Out[ ]:
0
Attrition_Flag 2
Gender 2
Education_Level 6
Marital_Status 3
Income_Category 6
Card_Category 4

Numerical unique variable

In [ ]:
data.select_dtypes(exclude="object").nunique()
Out[ ]:
0
CLIENTNUM 10127
Customer_Age 45
Dependent_count 6
Months_on_book 44
Total_Relationship_Count 6
Months_Inactive_12_mon 7
Contacts_Count_12_mon 7
Credit_Limit 6205
Total_Revolving_Bal 1974
Avg_Open_To_Buy 6813
Total_Amt_Chng_Q4_Q1 1158
Total_Trans_Amt 5033
Total_Trans_Ct 126
Total_Ct_Chng_Q4_Q1 830
Avg_Utilization_Ratio 964

Observation : unique value is of age =45 means customers are all of similar age

In [ ]:
data.describe()
Out[ ]:
CLIENTNUM Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
count 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000
mean 739177606.334 46.326 2.346 35.928 3.813 2.341 2.455 8631.954 1162.814 7469.140 0.760 4404.086 64.859 0.712 0.275
std 36903783.450 8.017 1.299 7.986 1.554 1.011 1.106 9088.777 814.987 9090.685 0.219 3397.129 23.473 0.238 0.276
min 708082083.000 26.000 0.000 13.000 1.000 0.000 0.000 1438.300 0.000 3.000 0.000 510.000 10.000 0.000 0.000
25% 713036770.500 41.000 1.000 31.000 3.000 2.000 2.000 2555.000 359.000 1324.500 0.631 2155.500 45.000 0.582 0.023
50% 717926358.000 46.000 2.000 36.000 4.000 2.000 2.000 4549.000 1276.000 3474.000 0.736 3899.000 67.000 0.702 0.176
75% 773143533.000 52.000 3.000 40.000 5.000 3.000 3.000 11067.500 1784.000 9859.000 0.859 4741.000 81.000 0.818 0.503
max 828343083.000 73.000 5.000 56.000 6.000 6.000 6.000 34516.000 2517.000 34516.000 3.397 18484.000 139.000 3.714 0.999

Observation:

Below observations are noted

Mean value for the Customer Age column is approx 46 and the median is also 45. This shows that majority of the customers are under 45 years of age. Dependent Count column has mean and median of ~2 Months on Book column has mean and median of 36 months. Minimum value is 13 months, showing that the dataset captures data for customers with the bank at least 1 whole years Total Relationship Count has mean and median of ~4 Credit Limit has a wide range of 1.4K to 34.5K, the median being 4.5K, way less than the mean 8.6K Total Transaction Count has mean of ~65 and median of 67

In [ ]:
data.describe(include='object')
Out[ ]:
Attrition_Flag Gender Education_Level Marital_Status Income_Category Card_Category
count 10127 10127 8608 9378 10127 10127
unique 2 2 6 3 6 4
top Existing Customer F Graduate Married Less than $40K Blue
freq 8500 5358 3128 4687 3561 9436
In [ ]:
def category_unique_value():
    for cat_cols in (
        data.select_dtypes(exclude=[np.int64, np.float64]).columns.unique().to_list()
    ):
        print("Unique values and corresponding data counts for feature: " + cat_cols)
        print("-" * 90)
        df_temp = pd.concat(
            [
                data[cat_cols].value_counts(),
                data[cat_cols].value_counts(normalize=True) * 100,
            ],
            axis=1,
        )
        df_temp.columns = ["Count", "Percentage"]
        print(df_temp)
        print("-" * 90)
In [ ]:
category_unique_value()
Unique values and corresponding data counts for feature: Attrition_Flag
------------------------------------------------------------------------------------------
                   Count  Percentage
Attrition_Flag                      
Existing Customer   8500      83.934
Attrited Customer   1627      16.066
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Gender
------------------------------------------------------------------------------------------
        Count  Percentage
Gender                   
F        5358      52.908
M        4769      47.092
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Education_Level
------------------------------------------------------------------------------------------
                 Count  Percentage
Education_Level                   
Graduate          3128      36.338
High School       2013      23.385
Uneducated        1487      17.275
College           1013      11.768
Post-Graduate      516       5.994
Doctorate          451       5.239
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Marital_Status
------------------------------------------------------------------------------------------
                Count  Percentage
Marital_Status                   
Married          4687      49.979
Single           3943      42.045
Divorced          748       7.976
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Income_Category
------------------------------------------------------------------------------------------
                 Count  Percentage
Income_Category                   
Less than $40K    3561      35.163
$40K - $60K       1790      17.676
$80K - $120K      1535      15.157
$60K - $80K       1402      13.844
abc               1112      10.981
$120K +            727       7.179
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Card_Category
------------------------------------------------------------------------------------------
               Count  Percentage
Card_Category                   
Blue            9436      93.177
Silver           555       5.480
Gold             116       1.145
Platinum          20       0.197
------------------------------------------------------------------------------------------

Observation:

It is observed that 93% of customers are blue. 5% are silver and 1% are gold and les then 1% are platinum card holders

PRE EDA DATA PROCESSING

In [ ]:
data.drop(columns=["CLIENTNUM"],inplace=True)
In [ ]:
# Check the actual column names in your DataFrame
print(data.columns)

# Assuming there is a typo or whitespace issue, try the following:
# Ensure correct casing and spacing
marital_status_col = next((col for col in data.columns if col.lower().strip() == 'marital_status'), None)

if marital_status_col:
    data['Marital_status'] = data[marital_status_col].fillna('unknown')
else:
    print("Warning: 'Marital_status' column not found in the DataFrame.")
    # Investigate data loading/preprocessing steps if the column is truly missing
Index(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
       'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
       'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')
In [ ]:
data.loc[data[data["Income_Category"] == "abc"].index, "Income_Category"] = "Unknown"
In [ ]:
category_unique_value()
Unique values and corresponding data counts for feature: Attrition_Flag
------------------------------------------------------------------------------------------
                   Count  Percentage
Attrition_Flag                      
Existing Customer   8500      83.934
Attrited Customer   1627      16.066
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Gender
------------------------------------------------------------------------------------------
        Count  Percentage
Gender                   
F        5358      52.908
M        4769      47.092
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Education_Level
------------------------------------------------------------------------------------------
                 Count  Percentage
Education_Level                   
Graduate          3128      36.338
High School       2013      23.385
Uneducated        1487      17.275
College           1013      11.768
Post-Graduate      516       5.994
Doctorate          451       5.239
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Marital_Status
------------------------------------------------------------------------------------------
                Count  Percentage
Marital_Status                   
Married          4687      49.979
Single           3943      42.045
Divorced          748       7.976
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Income_Category
------------------------------------------------------------------------------------------
                 Count  Percentage
Income_Category                   
Less than $40K    3561      35.163
$40K - $60K       1790      17.676
$80K - $120K      1535      15.157
$60K - $80K       1402      13.844
Unknown           1112      10.981
$120K +            727       7.179
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Card_Category
------------------------------------------------------------------------------------------
               Count  Percentage
Card_Category                   
Blue            9436      93.177
Silver           555       5.480
Gold             116       1.145
Platinum          20       0.197
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Marital_status
------------------------------------------------------------------------------------------
                Count  Percentage
Marital_status                   
Married          4687      46.282
Single           3943      38.936
unknown           749       7.396
Divorced          748       7.386
------------------------------------------------------------------------------------------
In [ ]:
print(df.isnull())  # Boolean mask
print(df.isnull().sum())  # Count missing values per column
print(df.info())  # Summary including non-null counts
       CLIENTNUM  Attrition_Flag  Customer_Age  Gender  Dependent_count  \
0          False           False         False   False            False   
1          False           False         False   False            False   
2          False           False         False   False            False   
3          False           False         False   False            False   
4          False           False         False   False            False   
...          ...             ...           ...     ...              ...   
10122      False           False         False   False            False   
10123      False           False         False   False            False   
10124      False           False         False   False            False   
10125      False           False         False   False            False   
10126      False           False         False   False            False   

       Education_Level  Marital_Status  Income_Category  Card_Category  \
0                False           False            False          False   
1                False           False            False          False   
2                False           False            False          False   
3                False            True            False          False   
4                False           False            False          False   
...                ...             ...              ...            ...   
10122            False           False            False          False   
10123             True           False            False          False   
10124            False           False            False          False   
10125            False            True            False          False   
10126            False           False            False          False   

       Months_on_book  Total_Relationship_Count  Months_Inactive_12_mon  \
0               False                     False                   False   
1               False                     False                   False   
2               False                     False                   False   
3               False                     False                   False   
4               False                     False                   False   
...               ...                       ...                     ...   
10122           False                     False                   False   
10123           False                     False                   False   
10124           False                     False                   False   
10125           False                     False                   False   
10126           False                     False                   False   

       Contacts_Count_12_mon  Credit_Limit  Total_Revolving_Bal  \
0                      False         False                False   
1                      False         False                False   
2                      False         False                False   
3                      False         False                False   
4                      False         False                False   
...                      ...           ...                  ...   
10122                  False         False                False   
10123                  False         False                False   
10124                  False         False                False   
10125                  False         False                False   
10126                  False         False                False   

       Avg_Open_To_Buy  Total_Amt_Chng_Q4_Q1  Total_Trans_Amt  Total_Trans_Ct  \
0                False                 False            False           False   
1                False                 False            False           False   
2                False                 False            False           False   
3                False                 False            False           False   
4                False                 False            False           False   
...                ...                   ...              ...             ...   
10122            False                 False            False           False   
10123            False                 False            False           False   
10124            False                 False            False           False   
10125            False                 False            False           False   
10126            False                 False            False           False   

       Total_Ct_Chng_Q4_Q1  Avg_Utilization_Ratio  
0                    False                  False  
1                    False                  False  
2                    False                  False  
3                    False                  False  
4                    False                  False  
...                    ...                    ...  
10122                False                  False  
10123                False                  False  
10124                False                  False  
10125                False                  False  
10126                False                  False  

[10127 rows x 21 columns]
CLIENTNUM                      0
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB
None
In [ ]:
df_null_summary = pd.concat(
    [data.isnull().sum(), data.isnull().sum() * 100 / data.isnull().count()], axis=1
)
df_null_summary.columns = ["Null Record Count", "Percentage of Null Records"]
df_null_summary[df_null_summary["Null Record Count"] > 0].sort_values(
    by="Percentage of Null Records", ascending=False
).style.background_gradient(cmap="YlOrRd")
Out[ ]:
  Null Record Count Percentage of Null Records
Education_Level 1519 14.999506
Marital_Status 749 7.396070
In [ ]:
category_columns = data.select_dtypes(include="object").columns.tolist()
In [ ]:
data[category_columns] = data[category_columns].astype("category")
In [ ]:
data.columns = [i.replace(" ", "_").lower() for i in data.columns]
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   attrition_flag            10127 non-null  category
 1   customer_age              10127 non-null  int64   
 2   gender                    10127 non-null  category
 3   dependent_count           10127 non-null  int64   
 4   education_level           8608 non-null   category
 5   marital_status            9378 non-null   category
 6   income_category           10127 non-null  category
 7   card_category             10127 non-null  category
 8   months_on_book            10127 non-null  int64   
 9   total_relationship_count  10127 non-null  int64   
 10  months_inactive_12_mon    10127 non-null  int64   
 11  contacts_count_12_mon     10127 non-null  int64   
 12  credit_limit              10127 non-null  float64 
 13  total_revolving_bal       10127 non-null  int64   
 14  avg_open_to_buy           10127 non-null  float64 
 15  total_amt_chng_q4_q1      10127 non-null  float64 
 16  total_trans_amt           10127 non-null  int64   
 17  total_trans_ct            10127 non-null  int64   
 18  total_ct_chng_q4_q1       10127 non-null  float64 
 19  avg_utilization_ratio     10127 non-null  float64 
 20  marital_status            10127 non-null  category
dtypes: category(7), float64(5), int64(9)
memory usage: 1.2 MB
In [ ]:
# Instead of using numerical indices, directly specify the columns to drop by name.
# Assuming you intend to drop the last two columns:

data.drop(columns=data.columns[-2:], inplace=True)
# data.columns[-2:] selects the last two column names.

data.head(2)
Out[ ]:
attrition_flag customer_age gender dependent_count education_level income_category card_category months_on_book total_relationship_count months_inactive_12_mon contacts_count_12_mon credit_limit total_revolving_bal avg_open_to_buy total_amt_chng_q4_q1 total_trans_amt total_trans_ct total_ct_chng_q4_q1
0 Existing Customer 45 M 3 High School $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625
1 Existing Customer 49 F 5 Graduate Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714

Exploratory Data Analysis

Univariate Analysis

In [ ]:
summary(data, "customer_age")
5 Point Summary of Customer_age Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |    26 |   41 |   46 |   52 |    73 |
+-------+-------+------+------+------+-------+
<ipython-input-36-b0c975a7f852>:43: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  ax1 = sns.distplot(data[x], color="purple")
<ipython-input-36-b0c975a7f852>:53: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
<ipython-input-36-b0c975a7f852>:57: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax3 = sns.boxplot(

Observation: It is observed thayt data is uniformly distributed

In [ ]:
summary(data, "dependent_count")
5 Point Summary of Dependent_count Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    1 |    2 |    3 |     5 |
+-------+-------+------+------+------+-------+
<ipython-input-36-b0c975a7f852>:43: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  ax1 = sns.distplot(data[x], color="purple")
<ipython-input-36-b0c975a7f852>:53: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
<ipython-input-36-b0c975a7f852>:57: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax3 = sns.boxplot(

Observation: It is observed that dependent count is 2 and 3

In [ ]:
def summary(data: pd.DataFrame, x: str):
    """
    The function prints the 5 point summary and histogram, box plot,
    violin plot, and cumulative density distribution plots for each
    feature name passed as the argument.

    Parameters:
    ----------

    x: str, feature name

    Usage:
    ------------

    summary('age')
    """
    x_min = data[x].min()
    x_max = data[x].max()
    Q1 = data[x].quantile(0.25)
    Q2 = data[x].quantile(0.50)
    Q3 = data[x].quantile(0.75)

    dict = {"Min": x_min, "Q1": Q1, "Q2": Q2, "Q3": Q3, "Max": x_max}
    df = pd.DataFrame(data=dict, index=["Value"])
    print(f"5 Point Summary of {x.capitalize()} Attribute:\n")
    # Assuming 'tabulate' is available, if not, install it using: !pip install tabulate
    from tabulate import tabulate
    print(tabulate(df, headers="keys", tablefmt="psql"))

    fig = plt.figure(figsize=(16, 8))
    plt.subplots_adjust(hspace=0.6)
    sns.set_palette("Pastel1")
    # Corrected indentation here:
    plt.subplot(221, frameon=True)
    ax1 = sns.distplot(data[x], color="purple")
    ax1.axvline(
        np.mean(data[x]), color="purple", linestyle="--"
    )  # Add mean to the histogram
    ax1.axvline(
        np.median(data[x]), color="black", linestyle="-"
    )  # Add median to the histogram
    plt.title(f"{x.capitalize()} Density Distribution")

    plt.subplot(222, frameon=True)
    ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
    plt.title(f"{x.capitalize()} Violinplot")

    plt.subplot(223, frameon=True, sharex=ax1)
    ax3 = sns.boxplot(
        x=data[x], palette="cool", width=0.7, linewidth=0.6, showmeans=True
    )
    plt.title(f"{x.capitalize()} Boxplot")

    plt.subplot(224, frameon=True, sharex=ax2)
    ax4 = sns.kdeplot(data[x], cumulative=True)
    plt.title(f"{x.capitalize()} Cumulative Density Distribution")

    plt.show()
In [ ]:
data['gender'].value_counts()
sns.countplot(data=df, x='Gender')
plt.pie(df['Gender'].value_counts(), labels = ['Female', 'Male'], autopct='%1.1f%%', shadow = True, startangle = 90)
plt.title('Proportion of Gender count', fontsize = 16)
plt.show()

Observation:

It is observed that gender distribution is normal

In [ ]:
summary(data, "total_relationship_count")
5 Point Summary of Total_relationship_count Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     1 |    3 |    4 |    5 |     6 |
+-------+-------+------+------+------+-------+
<ipython-input-225-d0d21b6c0e3c>:35: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  ax1 = sns.distplot(data[x], color="purple")
<ipython-input-225-d0d21b6c0e3c>:45: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
<ipython-input-225-d0d21b6c0e3c>:49: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax3 = sns.boxplot(

Observation:

It is observed that customers have 4 or more relations with the bank

In [ ]:
import matplotlib.pyplot as plt # Import the matplotlib.pyplot module
plt.pie(data['attrition_flag'].value_counts(), labels = ['Existing Customer', 'Attrited Customer'],
        autopct='%1.1f%%', startangle = 90)
plt.title('Proportion of Existing and Attrited Customer count', fontsize = 16)
plt.show()
In [ ]:
edu = data['education_level'].value_counts().to_frame('Counts')
plt.figure(figsize = (8,8))
# Use edu.index for x-axis and edu['Counts'] for y-axis
plt.plot(edu.index, edu['Counts'], marker='o')  # Added marker for better visualization
plt.title('Proportion of Education Levels', fontsize = 18)
plt.xlabel('Education Level')  # Added x-axis label
plt.ylabel('Counts')  # Added y-axis label
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
plt.show()
In [ ]:
plt.figure(figsize=(10,6))
sns.countplot(x='Attrition_Flag', hue='Marital_Status', data=df)
plt.title('Attrited and Existing Customers by Marital Status', fontsize=20)
Out[ ]:
Text(0.5, 1.0, 'Attrited and Existing Customers by Marital Status')

Observation: ' It is observed that Married people are more in the customers list

In [ ]:
def summary(data: pd.DataFrame, x: str):
    """
    The function prints the 5 point summary and histogram, box plot,
    violin plot, and cumulative density distribution plots for each
    feature name passed as the argument.

    Parameters:
    ----------

    x: str, feature name

    Usage:
    ------------

    summary('age')
    """
    # Convert the column name to lowercase to handle case sensitivity
    x = x.lower()

    # Check if the column exists in the DataFrame
    if x not in data.columns:
        print(f"Error: Column '{x}' not found in the DataFrame.")
        return  # Exit the function if the column is not found

    x_min = data[x].min()
    x_max = data[x].max()
    Q1 = data[x].quantile(0.25)
    Q2 = data[x].quantile(0.50)
    Q3 = data[x].quantile(0.75)

    dict = {"Min": x_min, "Q1": Q1, "Q2": Q2, "Q3": Q3, "Max": x_max}
    df = pd.DataFrame(data=dict, index=["Value"])
    print(f"5 Point Summary of {x.capitalize()} Attribute:\n")
    # Assuming 'tabulate' is available, if not, install it using: !pip install tabulate
    from tabulate import tabulate
    print(tabulate(df, headers="keys", tablefmt="psql"))

    fig = plt.figure(figsize=(16, 8))
    plt.subplots_adjust(hspace=0.6)
    sns.set_palette("Pastel1")
    # Corrected indentation here:
    plt.subplot(221, frameon=True)
    ax1 = sns.distplot(data[x], color="purple")
    ax1.axvline(
        np.mean(data[x]), color="purple", linestyle="--"
    )  # Add mean to the histogram
    ax1.axvline(
        np.median(data[x]), color="black", linestyle="-"
    )  # Add median to the histogram
    plt.title(f"{x.capitalize()} Density Distribution")

    plt.subplot(222, frameon=True)
    ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
    plt.title(f"{x.capitalize()} Violinplot")

    plt.subplot(223, frameon=True, sharex=ax1)
    ax3 = sns.boxplot(
        x=data[x], palette="cool", width=0.7, linewidth=0.6, showmeans=True
    )
    plt.title(f"{x.capitalize()} Boxplot")

    plt.subplot(224, frameon=True, sharex=ax2)
    ax4 = sns.kdeplot(data[x], cumulative=True)
    plt.title(f"{x.capitalize()} Cumulative Density Distribution")

    plt.show()
In [ ]:
summary(data, "Credit_Limit")
5 Point Summary of Credit_limit Attribute:

+-------+--------+------+------+---------+-------+
|       |    Min |   Q1 |   Q2 |      Q3 |   Max |
|-------+--------+------+------+---------+-------|
| Value | 1438.3 | 2555 | 4549 | 11067.5 | 34516 |
+-------+--------+------+------+---------+-------+
<ipython-input-233-b0c975a7f852>:43: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  ax1 = sns.distplot(data[x], color="purple")
<ipython-input-233-b0c975a7f852>:53: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
<ipython-input-233-b0c975a7f852>:57: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax3 = sns.boxplot(

Observation:

It is observed that There are higher end outliers in Credit Limit. This might be because the customers are high end.

In [ ]:
data[data["credit_limit"] > 23000]["income_category"].value_counts(normalize=True)
Out[ ]:
proportion
income_category
$80K - $120K 0.421
$120K + 0.302
$60K - $80K 0.156
Unknown 0.110
$40K - $60K 0.012
Less than $40K 0.000

In [ ]:
data[data["credit_limit"] > 23000]["card_category"].value_counts(normalize=True)
Out[ ]:
proportion
card_category
Blue 0.592
Silver 0.310
Gold 0.083
Platinum 0.015

Observation:

It is observed that 83% have gold card

In [ ]:
summary(data, "total_revolving_bal")
5 Point Summary of Total_revolving_bal Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |  359 | 1276 | 1784 |  2517 |
+-------+-------+------+------+------+-------+
<ipython-input-233-b0c975a7f852>:43: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  ax1 = sns.distplot(data[x], color="purple")
<ipython-input-233-b0c975a7f852>:53: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
<ipython-input-233-b0c975a7f852>:57: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax3 = sns.boxplot(

Observation:

It is observed that Data is right skewed

In [ ]:
summary(data, "total_amt_chng_q4_q1")
5 Point Summary of Total_amt_chng_q4_q1 Attribute:

+-------+-------+-------+-------+-------+-------+
|       |   Min |    Q1 |    Q2 |    Q3 |   Max |
|-------+-------+-------+-------+-------+-------|
| Value |     0 | 0.631 | 0.736 | 0.859 | 3.397 |
+-------+-------+-------+-------+-------+-------+
<ipython-input-233-b0c975a7f852>:43: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  ax1 = sns.distplot(data[x], color="purple")
<ipython-input-233-b0c975a7f852>:53: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
<ipython-input-233-b0c975a7f852>:57: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax3 = sns.boxplot(

Observation:

It is observed that Outliers are found on both the side

In [ ]:
summary(data, "total_trans_amt")
5 Point Summary of Total_trans_amt Attribute:

+-------+-------+--------+------+------+-------+
|       |   Min |     Q1 |   Q2 |   Q3 |   Max |
|-------+-------+--------+------+------+-------|
| Value |   510 | 2155.5 | 3899 | 4741 | 18484 |
+-------+-------+--------+------+------+-------+
<ipython-input-233-b0c975a7f852>:43: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  ax1 = sns.distplot(data[x], color="purple")
<ipython-input-233-b0c975a7f852>:53: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
<ipython-input-233-b0c975a7f852>:57: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax3 = sns.boxplot(
In [ ]:
summary(data, "total_trans_ct")
5 Point Summary of Total_trans_ct Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |    10 |   45 |   67 |   81 |   139 |
+-------+-------+------+------+------+-------+
<ipython-input-233-b0c975a7f852>:43: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  ax1 = sns.distplot(data[x], color="purple")
<ipython-input-233-b0c975a7f852>:53: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
<ipython-input-233-b0c975a7f852>:57: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax3 = sns.boxplot(
In [ ]:
def perc_on_bar(data: pd.DataFrame, cat_columns, target, hue=None, perc=True):
    '''
    The function takes a category column as input and plots bar chart with percentages on top of each bar

    Usage:
    ------

    perc_on_bar(df, ['age'], 'prodtaken')
    '''

    subplot_cols = 2
    subplot_rows = int(len(cat_columns)/2 + 1)
    plt.figure(figsize=(16,3*subplot_rows))
    for i, col in enumerate(cat_columns):
        plt.subplot(subplot_rows,subplot_cols,i+1)
        order = data[col].value_counts(ascending=False).index  # Data order
        ax=sns.countplot(data=data, x=col, palette = 'crest', order=order, hue=hue);
        for p in ax.patches:
            percentage = '{:.1f}%\n({})'.format(100 * p.get_height()/len(data[target]), p.get_height())
            # Added percentage and actual value
            x = p.get_x() + p.get_width() / 2
            y = p.get_y() + p.get_height() + 40
            if perc:
                plt.annotate(percentage, (x, y), ha='center', color='black', fontsize='medium'); # Annotation on top of bars
        plt.xticks(color='black', fontsize='medium', rotation= (-90 if col=='region' else 0));
        plt.tight_layout()
        plt.title(col.capitalize() + ' Percentage Bar Charts\n\n') # Moved out of inner loop
In [ ]:
category_columns = data.select_dtypes(include="category").columns.tolist()
target_variable = "attrition_flag"
perc_on_bar(data, category_columns, target_variable)
<ipython-input-241-499909de2952>:17: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax=sns.countplot(data=data, x=col, palette = 'crest', order=order, hue=hue);
<ipython-input-241-499909de2952>:17: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax=sns.countplot(data=data, x=col, palette = 'crest', order=order, hue=hue);
<ipython-input-241-499909de2952>:17: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax=sns.countplot(data=data, x=col, palette = 'crest', order=order, hue=hue);
<ipython-input-241-499909de2952>:17: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax=sns.countplot(data=data, x=col, palette = 'crest', order=order, hue=hue);
<ipython-input-241-499909de2952>:17: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax=sns.countplot(data=data, x=col, palette = 'crest', order=order, hue=hue);

Observation:

It is observed the below points

1.High Imbalance in data 2.Data is almost equally distributed between Males and Females 3.31% customers are Graduate 4.85% customers are either Single or Married, where 46.7% of the customers are Married 5.35% customers earn less than $40k and 36% earns $60k or more 6.93% customers have Blue card

Bi-variate Analysis

In [ ]:
def box_by_target(data: pd.DataFrame, numeric_columns, target, include_outliers):
    """
    The function takes a category column, target column, and whether to include outliers or not as input
    and plots bar chart with percentages on top of each bar

    Usage:
    ------

    perc_on_bar(['age'], 'prodtaken', True)
    """
    subplot_cols = 2
    subplot_rows = int(len(numeric_columns) / 2 + 1)
    plt.figure(figsize=(16, 3 * subplot_rows))
    for i, col in enumerate(numeric_columns):
        plt.subplot(8, 2, i + 1)
        sns.boxplot(
            data=data,
            x=target,
            y=col,
            orient="vertical",
                      palette="Blues",
            showfliers=include_outliers,
        )
        plt.tight_layout()
        plt.title(str(i + 1) + ": " + target + " vs. " + col, color="black")
In [ ]:
numeric_columns = data.select_dtypes(exclude="category").columns.tolist()
target_variable = "attrition_flag"
box_by_target(data, numeric_columns, target_variable, True)
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(

Observation:

The above graphs are considered with outliers

In [ ]:
box_by_target(data, numeric_columns, target_variable, False)
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(
<ipython-input-243-4ee4fe0a5510>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(

Observation:

The above graph is without outliers

In [ ]:
def cat_view(df: pd.DataFrame, x, target):
    """
    Function to create a Bar chart and a Pie chart for categorical variables.
    """
    from matplotlib import cm

    color1 = cm.inferno(np.linspace(0.4, 0.8, 30))
    color2 = cm.viridis(np.linspace(0.4, 0.8, 30))
    sns.set_palette("cubehelix")
    fig, ax = plt.subplots(1, 2, figsize=(16, 4))

    """
    Draw a Pie Chart on first subplot.
    """
    s = data.groupby(x).size()

    mydata_values = s.values.tolist()
    mydata_index = s.index.tolist()

    def func(pct, allvals):
        absolute = int(pct / 100.0 * np.sum(allvals))
        return "{:.1f}%\n({:d})".format(pct, absolute)

    wedges, texts, autotexts = ax[0].pie(
        mydata_values,
        autopct=lambda pct: func(pct, mydata_values),
        textprops=dict(color="w"),
    )

    ax[0].legend(
        wedges,
        mydata_index,
        title=x.capitalize(),
        loc="center left",
        bbox_to_anchor=(1, 0, 0.5, 1),
    )

    plt.setp(autotexts, size=12)

    ax[0].set_title(f"{x.capitalize()} Pie Chart")

    """
    Draw a Bar Graph on second subplot.
    """

    df = pd.pivot_table(
        data, index=[x], columns=[target], values=["credit_limit"], aggfunc=len
    )

    labels = df.index.tolist()
    no = df.values[:, 1].tolist()
    yes = df.values[:, 0].tolist()

    l = np.arange(len(labels))  # the label locations
    width = 0.35  # the width of the bars

    rects1 = ax[1].bar(
        l - width / 2, no, width, label="Existing Customer", color=color1
    )
    rects2 = ax[1].bar(
        l + width / 2, yes, width, label="Attrited Customer", color=color2
    )

    # Add some text for labels, title and custom x-axis tick labels, etc.
    ax[1].set_ylabel("Scores")
    ax[1].set_title(f"{x.capitalize()} Bar Graph")
    ax[1].set_xticks(l)
    ax[1].set_xticklabels(labels)
    ax[1].legend()

    def autolabel(rects):

        """Attach a text label above each bar in *rects*, displaying its height."""

        for rect in rects:
            height = rect.get_height()
            ax[1].annotate(
                "{}".format(height),
                xy=(rect.get_x() + rect.get_width() / 2, height),
                xytext=(0, 3),  # 3 points vertical offset
                textcoords="offset points",
                fontsize="medium",
                ha="center",
                va="bottom",
            )

    autolabel(rects1)
    autolabel(rects2)
    fig.tight_layout()
    plt.show()

    """
    Draw a Stacked Bar Graph on bottom.
    """

    sns.set(palette="tab10")
    tab = pd.crosstab(data[x], data[target], normalize="index")

    tab.plot.bar(stacked=True, figsize=(16, 3))
    plt.title(x.capitalize() + " Stacked Bar Plot")
    plt.legend(loc="upper right", bbox_to_anchor=(0, 1))
    plt.show()
In [ ]:
cat_view(data, "gender", "attrition_flag")
<ipython-input-246-7fc8ed6354d3>:15: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  s = data.groupby(x).size()
<ipython-input-246-7fc8ed6354d3>:46: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
  df = pd.pivot_table(

Observation:

It is observed that Attrition and gender are not related to each othert

In [ ]:
cat_view(data, "education_level", "attrition_flag")
<ipython-input-246-7fc8ed6354d3>:15: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  s = data.groupby(x).size()
<ipython-input-246-7fc8ed6354d3>:46: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
  df = pd.pivot_table(

Education and attrition are not related to each other is the observation

In [ ]:
cat_view(data, "income_category", "attrition_flag")
<ipython-input-43-7fc8ed6354d3>:15: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  s = data.groupby(x).size()
<ipython-input-43-7fc8ed6354d3>:46: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
  df = pd.pivot_table(
In [ ]:
cat_view(data, "card_category", "attrition_flag")
<ipython-input-43-7fc8ed6354d3>:15: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  s = data.groupby(x).size()
<ipython-input-43-7fc8ed6354d3>:46: FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior
  df = pd.pivot_table(
In [ ]:
f, ax = plt.subplots(figsize=(12, 8))
# Include only numerical features for correlation calculation
numerical_data = data.select_dtypes(include=np.number)
sns.heatmap(numerical_data.corr(), annot=True, cmap="Blues")
plt.show()

Observation:

The above graph gives correlation factors

In [ ]:
def feature_name_standardize(df: pd.DataFrame):
    df_ = df.copy()
    df_.columns = [i.replace(" ", "_").lower() for i in df_.columns]
    return df_

# Building a function to drop features

def drop_feature(df: pd.DataFrame, features: list = []):
    df_ = df.copy()
    if len(features) != 0:
        df_ = df_.drop(columns=features)

    return df_

def mask_value(df: pd.DataFrame, feature: str = None, value_to_mask: str = None, masked_value: str = None):
    data = df.copy() # This line and the subsequent lines within this function should be indented
    if feature != None and value_to_mask != None:
        if feature in df_.columns:
            df_[feature] = df_[feature].astype('object')
            df_.loc[df_[df_[feature] == value_to_mask].index, feature] = masked_value
            df_[feature] = df_[feature].astype('category')

    return df_

# Building a custom imputer

def impute_category_unknown(df: pd.DataFrame, fill_value: str):
    df_ = df.copy()
    for col in df_.select_dtypes(include='category').columns.tolist():
        df_[col] = df_[col].astype('object')
        df_[col] = df_[col].fillna('Unknown')
        df_[col] = df_[col].astype('category')
In [ ]:
df = data.copy()
df.describe(include="all").T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
attrition_flag 10127 2 Existing Customer 8500 NaN NaN NaN NaN NaN NaN NaN
customer_age 10127.000 NaN NaN NaN 46.326 8.017 26.000 41.000 46.000 52.000 73.000
gender 10127 2 F 5358 NaN NaN NaN NaN NaN NaN NaN
dependent_count 10127.000 NaN NaN NaN 2.346 1.299 0.000 1.000 2.000 3.000 5.000
education_level 8608 6 Graduate 3128 NaN NaN NaN NaN NaN NaN NaN
income_category 10127 6 Less than $40K 3561 NaN NaN NaN NaN NaN NaN NaN
card_category 10127 4 Blue 9436 NaN NaN NaN NaN NaN NaN NaN
months_on_book 10127.000 NaN NaN NaN 35.928 7.986 13.000 31.000 36.000 40.000 56.000
total_relationship_count 10127.000 NaN NaN NaN 3.813 1.554 1.000 3.000 4.000 5.000 6.000
months_inactive_12_mon 10127.000 NaN NaN NaN 2.341 1.011 0.000 2.000 2.000 3.000 6.000
contacts_count_12_mon 10127.000 NaN NaN NaN 2.455 1.106 0.000 2.000 2.000 3.000 6.000
credit_limit 10127.000 NaN NaN NaN 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
total_revolving_bal 10127.000 NaN NaN NaN 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
avg_open_to_buy 10127.000 NaN NaN NaN 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
total_amt_chng_q4_q1 10127.000 NaN NaN NaN 0.760 0.219 0.000 0.631 0.736 0.859 3.397
total_trans_amt 10127.000 NaN NaN NaN 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
total_trans_ct 10127.000 NaN NaN NaN 64.859 23.473 10.000 45.000 67.000 81.000 139.000
total_ct_chng_q4_q1 10127.000 NaN NaN NaN 0.712 0.238 0.000 0.582 0.702 0.818 3.714
In [ ]:
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age",
    "marital_status",
]
In [ ]:
column_to_mask_value = "income_category"
value_to_mask = "abc"
masked_value = "Unknown"

# Random state and loss
seed = 1
loss_func = "logloss"

# Test and Validation sizes
test_size = 0.2
val_size = 0.25

# Dependent Varibale Value map
target_mapper = {"Attrited Customer": 1, "Existing Customer": 0}
In [ ]:
cat_columns = df.select_dtypes(include="object").columns.tolist()
df[cat_columns] = df[cat_columns].astype("category")
In [ ]:
X = data.drop(columns=["attrition_flag"]) # Changed "Attrition_Flag" to "attrition_flag"
y = data["attrition_flag"].map(target_mapper) # Changed "Attrition_Flag" to "attrition_flag"
In [ ]:
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=test_size, random_state=seed, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=val_size, random_state=seed, stratify=y_temp
)
print(
    "Training data shape: \n\n",
    X_train.shape,
    "\n\nValidation Data Shape: \n\n",
    X_val.shape,
    "\n\nTesting Data Shape: \n\n",
    X_test.shape,
)
Training data shape: 

 (6075, 17) 

Validation Data Shape: 

 (2026, 17) 

Testing Data Shape: 

 (2026, 17)
In [ ]:
print("Training: \n", y_train.value_counts(normalize=True))
print("\n\nValidation: \n", y_val.value_counts(normalize=True))
print("\n\nTest: \n", y_test.value_counts(normalize=True))
Training: 
 attrition_flag
0   0.839
1   0.161
Name: proportion, dtype: float64


Validation: 
 attrition_flag
0   0.839
1   0.161
Name: proportion, dtype: float64


Test: 
 attrition_flag
0   0.840
1   0.160
Name: proportion, dtype: float64

Data processing

In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
    """
    A transformer to standardize feature names:
        - Replaces spaces with underscores.
        - Converts to lowercase.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if X is not None:  # Check if X is not None
            X_ = X.copy()
            X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
            return X_  # Return the modified DataFrame
        else:
            return None  # Or raise an exception if None is unexpected # Handle the case where X is None
In [ ]:
class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
    """
    A transformer to standardize feature names:
        - Replaces spaces with underscores.
        - Converts to lowercase.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
        return X_  # Return the modified DataFrame X_
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
    """
    A transformer to standardize feature names:
        - Replaces spaces with underscores.
        - Converts to lowercase.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if X is not None:  # Check if X is not None
            X_ = X.copy()
            X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
            return X_  # Return the modified DataFrame
        else:
            return None  # Or raise an exception if None is unexpected
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
    """
    A transformer to standardize feature names:
        - Replaces spaces with underscores.
        - Converts to lowercase.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
        return X_  # Return the modified DataFrame
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnDropper(BaseEstimator, TransformerMixin):
    def __init__(self, features):
        self.features = features

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop(columns=self.features, errors='ignore')  # errors='ignore' to handle cases where a feature might
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
    """
    A transformer to standardize feature names:
        - Replaces spaces with underscores.
        - Converts to lowercase.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        X
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
    """
    A transformer to standardize feature names:
        - Replaces spaces with underscores.
        - Converts to lowercase.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
        return X_  # Return the modified DataFrame X_
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureNamesStandardizer(BaseEstimator, TransformerMixin):
    """
    A transformer to standardize feature names:
        - Replaces spaces with underscores.
        - Converts to lowercase.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        X_.columns = [i.replace(" ", "_").lower() for i in X_.columns]
        return X_  # Return the modified DataFrame X_
In [ ]:
# To Standardize feature names
feature_name_standardizer = FeatureNamesStandardizer()

# Assuming 'X' and 'y' are your features and target variable
# Split your data using train_test_split BEFORE the check for None values
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=test_size, random_state=seed, stratify=y
)

# Split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=val_size, random_state=seed, stratify=y_temp
)

# Check if X_train, X_val, X_test are None before proceeding
# This check should now pass as the variables have been assigned values
if X_train is None or X_val is None or X_test is None:
    raise ValueError("X_train, X_val, or X_test are None. Check your data loading and splitting.")

# Proceed with your data processing steps

X_train = feature_name_standardizer.fit_transform(X_train)
X_val = feature_name_standardizer.transform(X_val)
X_test = feature_name_standardizer.transform(X_test)
X_train = feature_name_standardizer.fit_transform(X_train)
X_val = feature_name_standardizer.transform(X_val)
X_test = feature_name_standardizer.transform(X_test)

# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)

X_train = column_dropper.fit_transform(X_train)
X_val = column_dropper.transform(X_val)
X_test = column_dropper.transform(X_test)
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomValueMasker(BaseEstimator, TransformerMixin):
    def __init__(self, feature, value_to_mask, masked_value):
        self.feature = feature
        self.value_to_mask = value_to_mask
        self.masked_value = masked_value

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
from sklearn.base import BaseEstimator, TransformerMixin

class CustomValueMasker(BaseEstimator, TransformerMixin):
    def __init__(self, feature, value_to_mask, masked_value):
        self.feature = feature
        self.value_to_mask = value_to_mask
        self.masked_value = masked_value

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        # Check if the feature exists in the DataFrame before masking
        if self.feature in X_.columns:
            X_[self.feature] = X_[self.feature].replace(self.value_to_mask, self.masked_value)
        return X_
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class FillUnknown(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        for col in X_.columns:
            # Access dtype of the Series using .dtypes
            if X_[col].dtypes.name == 'category':
                X_[col] = X_[col].cat.add_categories('Unknown')
                X_[col] = X_[col].fillna('Unknown')
        return X_
In [ ]:
robust_scaler = RobustScaler(with_centering=False, with_scaling=True)
num_columns = [
    "total_relationship_count",
    "months_inactive_12_mon",
    "contacts_count_12_mon",
    "total_revolving_bal",
    "total_amt_chng_q4_q1",
    "total_trans_amt",
    "total_trans_ct",
    "total_ct_chng_q4_q1",

]
In [ ]:
X_train[num_columns] = pd.DataFrame(
    robust_scaler.fit_transform(X_train[num_columns]),
    columns=num_columns,
    index=X_train.index,
)
X_val[num_columns] = pd.DataFrame(
    robust_scaler.transform(X_val[num_columns]), columns=num_columns, index=X_val.index
)
X_test[num_columns] = pd.DataFrame(
    robust_scaler.transform(X_test[num_columns]),
    columns=num_columns,
    index=X_test.index,
)
In [ ]:
print(X_train.columns)
print(X_val.columns)
print(X_test.columns)
Index(['gender', 'education_level', 'income_category', 'card_category',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'total_revolving_bal', 'total_amt_chng_q4_q1',
       'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1'],
      dtype='object')
Index(['gender', 'education_level', 'income_category', 'card_category',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'total_revolving_bal', 'total_amt_chng_q4_q1',
       'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1'],
      dtype='object')
Index(['gender', 'education_level', 'income_category', 'card_category',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'total_revolving_bal', 'total_amt_chng_q4_q1',
       'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1'],
      dtype='object')
In [ ]:
X_train.head(3)
Out[ ]:
gender education_level marital_status income_category card_category total_relationship_count months_inactive_12_mon contacts_count_12_mon total_revolving_bal total_amt_chng_q4_q1 total_trans_amt total_trans_ct total_ct_chng_q4_q1 avg_utilization_ratio marital_status
800 M NaN Single $120K + Blue 3.000 4.000 3.000 1.226 2.044 0.648 1.278 2.249 0.168 Single
498 M NaN Married abc Blue 3.000 2.000 0.000 1.450 1.697 0.524 0.861 2.667 1.376 Married
4356 M High School Married $80K - $120K Blue 2.500 1.000 2.000 1.926 3.829 1.661 2.194 3.717 0.775 Married
In [ ]:
X_val.head(3)
Out[ ]:
gender education_level marital_status income_category card_category total_relationship_count months_inactive_12_mon contacts_count_12_mon total_revolving_bal total_amt_chng_q4_q1 total_trans_amt total_trans_ct total_ct_chng_q4_q1 avg_utilization_ratio marital_status
2894 M Post-Graduate Single $80K - $120K Blue 2.500 2.000 3.000 0.000 5.083 1.148 1.528 4.068 0.000 Single
9158 M Uneducated Single $80K - $120K Blue 0.500 3.000 1.000 0.000 3.982 3.148 1.639 3.810 0.000 Single
9618 M Uneducated Married $120K + Platinum 1.500 4.000 3.000 1.584 3.860 5.291 2.833 2.300 0.126 Married
In [ ]:
X_test.head(3)
Out[ ]:
gender education_level marital_status income_category card_category total_relationship_count months_inactive_12_mon contacts_count_12_mon total_revolving_bal total_amt_chng_q4_q1 total_trans_amt total_trans_ct total_ct_chng_q4_q1 avg_utilization_ratio marital_status
9760 M High School Single $80K - $120K Blue 1.000 3.000 2.000 0.865 3.316 5.556 2.583 2.544 0.369 Single
7413 M Post-Graduate Single $60K - $80K Blue 2.000 3.000 2.000 0.000 3.219 0.850 1.139 2.190 0.000 Single
6074 F High School Married $40K - $60K Blue 1.500 3.000 3.000 0.000 3.237 1.658 2.056 3.215 0.000 Married
In [ ]:
print(
    "Training data shape: \n\n",
    X_train.shape,
    "\n\nValidation Data Shape: \n\n",
    X_val.shape,
    "\n\nTesting Data Shape: \n\n",
    X_test.shape,
)
Training data shape: 

 (6075, 15) 

Validation Data Shape: 

 (2026, 15) 

Testing Data Shape: 

 (2026, 15)

Model Building Considerations¶

Model evaluation criterion:

Model can make wrong predictions as: Predicting a customer will attrite and the customer does not attrite - Loss of resources Predicting a customer will not attrite and the customer attrites - Loss of opportunity for churning the customer Which case is more important? Predicting that customer will not attrite, but actually attrites, would result in loss for the bank since if predicted correctly,marketing/sales team could have contacted the customer to retain them. This would result in losses. So, the false negatives should be minimized. How to reduce this loss i.e need to reduce False Negatives? Company wants Recall to be maximized, greater the Recall lesser the chances of false negatives. Let's start by building different models using KFold and cross_val_score and tune the best model using RandomizedSearchCV

Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.

In [ ]:
def get_metrics_score(
    model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
    """
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
    roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
    """
    """
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
    roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
    """
In [ ]:
def get_metrics_score(
    model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
    """
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
    roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
    """
    score_list = [] # This line was outside the function, causing issues

    # Indent the following lines to be part of the function body
    pred_train = model.predict_proba(train)[:, 1] > threshold
    pred_test = model.predict_proba(test)[:, 1] > threshold

    pred_train = np.round(pred_train)
    pred_test = np.round(pred_test)

    train_acc = accuracy_score(pred_train, train_y)
    test_acc = accuracy_score(pred_test, test_y)

    train_recall = recall_score(train_y, pred_train)
    test_recall = recall_score(test_y, pred_test)
def get_metrics_score(
    model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
    """
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
    roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
    """
    score_list = []
    # Indent the following lines to be part of the function body
    pred_train = model.predict_proba(train)[:, 1] > threshold
    pred_test = model.predict_proba(test)[:, 1] > threshold

    pred_train = np.round(pred_train)
    pred_test = np.round(pred_test)

    train_acc = accuracy_score(pred_train, train_y)
    test_acc = accuracy_score(pred_test, test_y)

    train_recall = recall_score(train_y, pred_train)
    test_recall = recall_score(test_y, pred_test)

    train_precision = precision_score(train_y, pred_train)
    test_precision = precision_score(test_y, pred_test)
In [ ]:
def get_metrics_score(
    model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
    """
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
    roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
    """
    # Initialize score_list within the function
    score_list = []

    pred_train = model.predict_proba(train)[:, 1] > threshold
    pred_test = model.predict_proba(test)[:, 1] > threshold

    pred_train = np.round(pred_train)
    pred_test = np.round(pred_test)

    train_acc = accuracy_score(pred_train, train_y)
    test_acc = accuracy_score(pred_test, test_y)

    train_recall
In [ ]:
def get_metrics_score(
    model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
    """
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
    roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
    """
    # Initialize score_list within the function
    score_list = []

    pred_train = (model.predict_proba(train)[:, 1] > threshold)
    pred_test = (model.predict_proba(test)[:, 1] > threshold)

    pred_train = np.round(pred_train)
    pred_test = np.round(pred_test)

    train_acc = accuracy_score(train_y, pred_train)
    test_acc = accuracy_score(test_y, pred_test)

    train_recall = recall_score(train_y, pred_train)  # Calculate train_recall
    test_recall = recall_score(test_y, pred_test)  # Calculate test_recall

    train_precision = precision_score(train_y, pred_train) # Calculate train_precision
    test_precision = precision_score(test_y, pred_test)  # Calculate test_precision

    train_f1 = f1_score(train_y, pred_train) # Calculate train_f1
    test_f1 = f1_score(test_y, pred_test)  # Calculate test_f1

    # Add the calculated values to the score_list
    # ... (rest of the function implementation to append scores to score_list) ...

    return score_list # Correct indentation for return statement

Function for Confusion Matrix

In [ ]:
def make_confusion_matrix(model, test_X, y_actual, labels=[1, 0]):
    """
    model : classifier to predict values of X
    test_X: test set
    y_actual : ground truth

    """
    y_predict = model.predict(test_X)
    cm = metrics.confusion_matrix(y_actual, y_predict, labels=[1, 0])
    df_cm = pd.DataFrame(
        cm,
        index=[i for i in ["Actual - Attrited", "Actual - Existing"]],
        columns=[i for i in ["Predicted - Attrited", "Predicted - Existing"]],
    )
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
    labels = np.asarray(labels).reshape(2, 2)
    plt.figure(figsize=(5, 3))
    sns.heatmap(df_cm, annot=labels, fmt="", cmap="Blues").set(title="Confusion Matrix")

Add scores to score list

In [ ]:
model_names = []
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
roc_auc_train = []
roc_auc_test = []
cross_val_train = []
In [ ]:
def add_score_model(model_name, score, cv_res):
    """Add scores to list so that we can compare all models score together"""
    model_names.append(model_name)
    acc_train.append(score[0])
    acc_test.append(score[1])
    recall_train.append(score[2])
    recall_test.append(score[3])
    precision_train.append(score[4])
    precision_test.append(score[5])
    f1_train.append(score[6])
    f1_test.append(score[7])
    roc_auc_train.append(score[8])
    roc_auc_test.append(score[9])
    cross_val_train.append(cv_res)

Building Models¶ We are building 8 models here, Logistic Regression, Bagging, Random Forest, Gradient Boosting, Ada Boosting, Extreme Gradient Boosting, Decision Tree, and Light Gradient Boosting.

Build and Train Models We are building below 5 models:

Bagging Random Forest Classification Gradient Boosting Machine Adaptive Boosting eXtreme Gradient Boosting

In [ ]:
models = []  # Empty list to store all the models
cv_results = []

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=seed)))
models.append(("Random forest", RandomForestClassifier(random_state=seed)))
models.append(("GBM", GradientBoostingClassifier(random_state=seed)))
models.append(("Adaboost", AdaBoostClassifier(random_state=seed)))
models.append(("Xgboost", XGBClassifier(random_state=seed, eval_metric=loss_func)))
models.append(("dtree", DecisionTreeClassifier(random_state=seed)))
models.append(("Light GBM", lgb.LGBMClassifier(random_state=seed)))
In [ ]:
print(X_train.dtypes)  # Identify non-numeric columns
print(X_train.select_dtypes(include=['object']).head())
gender                      category
education_level             category
income_category             category
card_category               category
total_relationship_count     float64
months_inactive_12_mon       float64
contacts_count_12_mon        float64
total_revolving_bal          float64
total_amt_chng_q4_q1         float64
total_trans_amt              float64
total_trans_ct               float64
total_ct_chng_q4_q1          float64
dtype: object
Empty DataFrame
Columns: []
Index: [800, 498, 4356, 407, 8728]
In [ ]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
In [ ]:
categorical_cols = X_train.select_dtypes(include=['category']).columns
print("Categorical Columns:", categorical_cols)
Categorical Columns: Index(['gender', 'education_level', 'income_category', 'card_category'], dtype='object')
In [ ]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Identify categorical columns
categorical_cols = X_train.select_dtypes(include=['category']).columns
In [ ]:
print(X_train.dtypes)  # Should only contain numerical data (int or float)
print(X_val.dtypes)
print(X_train.shape, X_val.shape)
gender                      category
education_level             category
income_category             category
card_category               category
total_relationship_count     float64
months_inactive_12_mon       float64
contacts_count_12_mon        float64
total_revolving_bal          float64
total_amt_chng_q4_q1         float64
total_trans_amt              float64
total_trans_ct               float64
total_ct_chng_q4_q1          float64
dtype: object
gender                      category
education_level             category
income_category             category
card_category               category
total_relationship_count     float64
months_inactive_12_mon       float64
contacts_count_12_mon        float64
total_revolving_bal          float64
total_amt_chng_q4_q1         float64
total_trans_amt              float64
total_trans_ct               float64
total_ct_chng_q4_q1          float64
dtype: object
(6075, 12) (2026, 12)
In [ ]:
print(X_train.dtypes)  # Check data types
print(X_train.select_dtypes(include=['object', 'category']).head())  # Show categorical data
gender                      category
education_level             category
marital_status              category
income_category             category
card_category               category
total_relationship_count     float64
months_inactive_12_mon       float64
contacts_count_12_mon        float64
total_revolving_bal          float64
total_amt_chng_q4_q1         float64
total_trans_amt              float64
total_trans_ct               float64
total_ct_chng_q4_q1          float64
avg_utilization_ratio        float64
marital_status              category
dtype: object
     gender education_level marital_status income_category card_category  \
800       M             NaN         Single         $120K +          Blue   
498       M             NaN        Married             abc          Blue   
4356      M     High School        Married    $80K - $120K          Blue   
407       M        Graduate            NaN     $60K - $80K        Silver   
8728      M     High School       Divorced     $40K - $60K        Silver   

     marital_status  
800          Single  
498         Married  
4356        Married  
407         unknown  
8728       Divorced  
In [ ]:
import pandas as pd

X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_val_encoded = pd.get_dummies(X_val, drop_first=True)

# Ensure both have same columns after encoding
X_train_encoded, X_val_encoded = X_train_encoded.align(X_val_encoded, join='left', axis=1, fill_value=0)
In [ ]:
from sklearn.preprocessing import LabelEncoder

# Reset index in case it was causing issues
X_train = X_train.reset_index(drop=True)
X_val = X_val.reset_index(drop=True)

label_encoders = {}
for col in X_train.select_dtypes(include=['object', 'category']).columns:
    # Ensure the column is a Series before applying LabelEncoder
    if X_train[col].ndim == 1:  # Check if the column is 1-dimensional
        le = LabelEncoder()
        X_train[col] = le.fit_transform(X_train[col])
        X_val[col] = le.transform(X_val[col])
        label_encoders[col] = le
    else:
        print(f"Warning: Skipping column '{col}' as it is not 1-dimensional.")
In [ ]:
def get_metrics_score(model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True):
    """
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
    roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
    """
    # Initialize score_list within the function
    score_list = []

    pred_train = (model.predict_proba(train)[:, 1] > threshold)
    pred_test = (model.predict_proba(test)[:, 1] > threshold)

    pred_train = np.round(pred_train)
    pred_test = np.round(pred_test)

    train_acc = accuracy_score(train_y, pred_train)
    test_acc = accuracy_score(test_y, pred_test)

    train_recall = recall_score(train_y, pred_train)
    test_recall = recall_score(test_y, pred_test)

    train_precision = precision_score(train_y, pred_train)
    test_precision = precision_score(test_y, pred_test)

    train_f1 = f1_score(train_y, pred_train)
    test_f1 = f1_score(test_y, pred_test)

    # Append the calculated metric scores to score_list
    score_list.extend([train_acc, test_acc, train_recall, test_recall, train_precision, test_precision, train_f1, test_f1])

    return score_list # Correct indentation for return statement

    print("operation completed")
In [ ]:
print("Score Output:", model_score)  # Debugging step
print("Length of Score:", len(model_score))
Score Output: [0.9963786008230453, 0.9580454096742349, 0.9815573770491803, 0.8374233128834356, 0.9958419958419958, 0.8950819672131147, 0.9886480908152735, 0.8652931854199684]
Length of Score: 8
In [ ]:
def get_metrics_score(model, X_train, X_val, y_train, y_val):
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

    # Predictions
    y_train_pred = model.predict(X_train)
    y_val_pred = model.predict(X_val)

    # Calculate metrics
    accuracy_train = accuracy_score(y_train, y_train_pred)
    accuracy_val = accuracy_score(y_val, y_val_pred)

    precision_train = precision_score(y_train, y_train_pred)
    precision_val = precision_score(y_val, y_val_pred)

    recall_train = recall_score(y_train, y_train_pred)
    recall_val = recall_score(y_val, y_val_pred)

    f1_train = f1_score(y_train, y_train_pred)
    f1_val = f1_score(y_val, y_val_pred)

    roc_auc_train = roc_auc_score(y_train, y_train_pred)
    roc_auc_val = roc_auc_score(y_val, y_val_pred)

    # Return all 9 values
    return [
        accuracy_train, accuracy_val,
        precision_train, precision_val,
        recall_train, recall_val,
        f1_train, f1_val,
        roc_auc_train, roc_auc_val
    ]
In [ ]:
def add_score_model(model_name, score, cv_res):
    model_names.append(model_name)

    # Check length before appending
    if len(score) < 10:
        print(f"Warning: `score` has only {len(score)} elements!")

    accuracy_train.append(score[0])
    accuracy_test.append(score[1])
    precision_train.append(score[2])
    precision_test.append(score[3])
    recall_train.append(score[4])
    recall_test.append(score[5])
    f1_train.append(score[6])
    f1_test.append(score[7])

    # Only append if enough values exist
    if len(score) > 8:
        roc_auc_train.append(score[8])
        roc_auc_test.append(score[9])

    cross_val_train.append(cv_res)
In [ ]:
print("Length of model_names:", len(model_names))
print("Length of cross_val_train:", len(cross_val_train))
print("Length of acc_train:", len(acc_train))
print("Length of acc_test:", len(acc_test))
print("Length of recall_train:", len(recall_train))
print("Length of recall_test:", len(recall_test))
print("Length of precision_train:", len(precision_train))
print("Length of precision_test:", len(precision_test))
print("Length of f1_train:", len(f1_train))
print("Length of f1_test:", len(f1_test))
print("Length of roc_auc_train:", len(roc_auc_train))
print("Length of roc_auc_test:", len(roc_auc_test))
Length of model_names: 2
Length of cross_val_train: 0
Length of acc_train: 2
Length of acc_test: 2
Length of recall_train: 2
Length of recall_test: 2
Length of precision_train: 2
Length of precision_test: 2
Length of f1_train: 2
Length of f1_test: 2
Length of roc_auc_train: 0
Length of roc_auc_test: 0
In [ ]:
max_length = len(model_names)  # Use model_names as the reference length

# Ensure all lists have the same length
lists_to_fix = [
    cross_val_train, acc_train, acc_test, recall_train, recall_test,
    precision_train, precision_test, f1_train, f1_test, roc_auc_train, roc_auc_test
]

for lst in lists_to_fix:
    while len(lst) < max_length:
        lst.append(None)  # or 0.0 if you prefer numerical values

# Now, re-run the DataFrame creation
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)
In [ ]:
for name, model in models:
    print(f"Processing model: {name}")  # Debugging line

    scoring = "recall"
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

    try:
        cv_result = cross_val_score(model, X_train, y_train, scoring=scoring, cv=kfold)
        cv_results.append(cv_result)

        model.fit(X_train, y_train)
        model_score = get_metrics_score(model, X_train, X_val, y_train, y_val)
        add_score_model(name, model_score, cv_result.mean())
    except Exception as e:
        print(f"⚠️ Error in model {name}: {e}")  # Catch any errors
Processing model: Bagging
⚠️ Error in model Bagging: name 'accuracy_train' is not defined
Processing model: Random forest
⚠️ Error in model Random forest: name 'accuracy_train' is not defined
Processing model: GBM
⚠️ Error in model GBM: name 'accuracy_train' is not defined
Processing model: Adaboost
⚠️ Error in model Adaboost: name 'accuracy_train' is not defined
Processing model: Xgboost
⚠️ Error in model Xgboost: name 'accuracy_train' is not defined
Processing model: dtree
⚠️ Error in model dtree: name 'accuracy_train' is not defined
Processing model: Light GBM
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000796 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000301 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000288 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000257 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000255 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000256 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000249 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000278 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000277 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 878, number of negative: 4590
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000282 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160571 -> initscore=-1.653989
[LightGBM] [Info] Start training from score -1.653989
[LightGBM] [Info] Number of positive: 976, number of negative: 5099
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000305 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 6075, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160658 -> initscore=-1.653337
[LightGBM] [Info] Start training from score -1.653337
⚠️ Error in model Light GBM: name 'accuracy_train' is not defined

Comparing Models

In [ ]:
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
roc_auc_train = []
roc_auc_test = []
cross_val_train = []
model_names = []
In [ ]:
def add_score_model(model_name, score, cv_res):
    global acc_train, acc_test, recall_train, recall_test
    global precision_train, precision_test, f1_train, f1_test
    global roc_auc_train, roc_auc_test, cross_val_train, model_names

    # Debugging print statement
    print(f"Adding model: {model_name}")

    model_names.append(model_name)
    acc_train.append(score[0])
    acc_test.append(score[1])
    recall_train.append(score[2])
    recall_test.append(score[3])
    precision_train.append(score[4])
    precision_test.append(score[5])
    f1_train.append(score[6])
    f1_test.append(score[7])

    # Ensure index 8 & 9 exist in score
    if len(score) > 8:
        roc_auc_train.append(score[8])
        roc_auc_test.append(score[9])
    else:
        roc_auc_train.append(None)
        roc_auc_test.append(None)

    cross_val_train.append(cv_res)
In [ ]:
print(f"Model: {name}, Score Output: {model_score}")
Model: Light GBM, Score Output: [0.9991769547325103, 0.9703849950641659, 0.9959141981613892, 0.9182389937106918, 0.9989754098360656, 0.8957055214723927, 0.9974424552429667, 0.906832298136646, 0.9990954711467052, 0.940205701912667]
In [ ]:
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)
In [ ]:
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
    by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
Out[ ]:
  Model Cross_Val_Score_Train Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
In [ ]:
def add_score_model(model_name, score, cv_res):
    print(f"Model: {model_name}, Score: {score}, CV Result: {cv_res}")  # Debugging print

    model_names.append(model_name)
    cross_val_train.append(cv_res)
    acc_train.append(score[0])
    acc_test.append(score[1])
    recall_train.append(score[2])
    recall_test.append(score[3])
    precision_train.append(score[4])
    precision_test.append(score[5])
    f1_train.append(score[6])
    f1_test.append(score[7])
    roc_auc_train.append(score[8])
    roc_auc_test.append(score[9])
In [ ]:
for name, model in models:
    print(f"Processing model: {name}")  # Debugging step

    scoring = "recall"
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

    try:
        cv_result = cross_val_score(model, X_train, y_train, scoring=scoring, cv=kfold)
        cv_results.append(cv_result)

        model.fit(X_train, y_train)
        model_score = get_metrics_score(model, X_train, X_val, y_train, y_val)

        print(f"Calling add_score_model() for: {name}")  # Debugging print
        add_score_model(name, model_score, cv_result.mean())  # Check if this runs

    except Exception as e:
        print(f"⚠️ Error in model {name}: {e}")

print("Operation Completed!")
Processing model: Bagging
Calling add_score_model() for: Bagging
Model: Bagging, Score: [0.9963786008230453, 0.9580454096742349, 0.9958419958419958, 0.8950819672131147, 0.9815573770491803, 0.8374233128834356, 0.9886480908152735, 0.8652931854199684, 0.9903864547532625, 0.9092998917358355], CV Result: 0.7878813381022512
Processing model: Random forest
Calling add_score_model() for: Random forest
Model: Random forest, Score: [1.0, 0.9674234945705824, 1.0, 0.9421768707482994, 1.0, 0.8496932515337423, 1.0, 0.8935483870967742, 1.0, 0.9198466257668712], CV Result: 0.8002209131075111
Processing model: GBM
Calling add_score_model() for: GBM
Model: GBM, Score: [0.968724279835391, 0.9693978282329714, 0.9337748344370861, 0.9342105263157895, 0.8668032786885246, 0.8711656441717791, 0.8990435706695006, 0.9015873015873016, 0.9275181327743468, 0.929700469144713], CV Result: 0.8032610982537344
Processing model: Adaboost
Calling add_score_model() for: Adaboost
Model: Adaboost, Score: [0.9425514403292181, 0.9550839091806516, 0.8718861209964412, 0.9065743944636678, 0.7530737704918032, 0.803680981595092, 0.8081363386476086, 0.8520325203252033, 0.8659465734200534, 0.8938993143269578], CV Result: 0.7295602777193351
Processing model: Xgboost
Calling add_score_model() for: Xgboost
Model: Xgboost, Score: [1.0, 0.9698914116485686, 1.0, 0.9206349206349206, 1.0, 0.8895705521472392, 1.0, 0.9048361934477379, 1.0, 0.937432334897149], CV Result: 0.8452661476961918
Processing model: dtree
Calling add_score_model() for: dtree
Model: dtree, Score: [1.0, 0.9353405725567621, 1.0, 0.7927927927927928, 1.0, 0.8098159509202454, 1.0, 0.8012139605462822, 1.0, 0.8846138578130639], CV Result: 0.7643383126446455
Processing model: Light GBM
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000263 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000251 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000294 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000281 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000262 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000270 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000269 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000286 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000320 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 878, number of negative: 4590
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000278 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160571 -> initscore=-1.653989
[LightGBM] [Info] Start training from score -1.653989
[LightGBM] [Info] Number of positive: 976, number of negative: 5099
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000315 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 6075, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160658 -> initscore=-1.653337
[LightGBM] [Info] Start training from score -1.653337
Calling add_score_model() for: Light GBM
Model: Light GBM, Score: [0.9991769547325103, 0.9703849950641659, 0.9959141981613892, 0.9182389937106918, 0.9989754098360656, 0.8957055214723927, 0.9974424552429667, 0.906832298136646, 0.9990954711467052, 0.940205701912667], CV Result: 0.8401535872080791
Operation Completed!
In [ ]:
print(f"Metrics for {name}: {model_score}")
Metrics for Light GBM: [0.9991769547325103, 0.9703849950641659, 0.9959141981613892, 0.9182389937106918, 0.9989754098360656, 0.8957055214723927, 0.9974424552429667, 0.906832298136646, 0.9990954711467052, 0.940205701912667]
In [ ]:
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)

# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
    by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
Out[ ]:
  Model Cross_Val_Score_Train Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
4 Xgboost 0.845266 1.000000 0.969891 1.000000 0.920635 1.000000 0.889571 1.000000 0.904836 1.000000 0.937432
6 Light GBM 0.840154 0.999177 0.970385 0.995914 0.918239 0.998975 0.895706 0.997442 0.906832 0.999095 0.940206
2 GBM 0.803261 0.968724 0.969398 0.933775 0.934211 0.866803 0.871166 0.899044 0.901587 0.927518 0.929700
1 Random forest 0.800221 1.000000 0.967423 1.000000 0.942177 1.000000 0.849693 1.000000 0.893548 1.000000 0.919847
0 Bagging 0.787881 0.996379 0.958045 0.995842 0.895082 0.981557 0.837423 0.988648 0.865293 0.990386 0.909300
5 dtree 0.764338 1.000000 0.935341 1.000000 0.792793 1.000000 0.809816 1.000000 0.801214 1.000000 0.884614
3 Adaboost 0.729560 0.942551 0.955084 0.871886 0.906574 0.753074 0.803681 0.808136 0.852033 0.865947 0.893899

Observation:

It is observed that The best model with respect to cross validation score and test recall is Light GBM The next best models are XGBoost, GBM and AdaBoost respectively

In [ ]:
print(f"Number of models: {len(model_names)}")
print(f"Number of CV results: {len(cv_results)}")
for i, scores in enumerate(cv_results):
    print(f"Model {i}: {len(scores)} scores -> {scores}")
Number of models: 7
Number of CV results: 15
Model 0: 10 scores -> [0.76530612 0.74489796 0.76530612 0.8877551  0.80612245 0.77319588
 0.80412371 0.7628866  0.78350515 0.78571429]
Model 1: 10 scores -> [0.76530612 0.74489796 0.76530612 0.8877551  0.80612245 0.77319588
 0.80412371 0.7628866  0.78350515 0.78571429]
Model 2: 10 scores -> [0.76530612 0.79591837 0.74489796 0.82653061 0.81632653 0.77319588
 0.83505155 0.79381443 0.81443299 0.83673469]
Model 3: 10 scores -> [0.76530612 0.80612245 0.76530612 0.82653061 0.81632653 0.79381443
 0.81443299 0.7628866  0.82474227 0.85714286]
Model 4: 10 scores -> [0.69387755 0.7244898  0.67346939 0.70408163 0.69387755 0.7628866
 0.77319588 0.70103093 0.73195876 0.83673469]
Model 5: 10 scores -> [0.82653061 0.81632653 0.83673469 0.87755102 0.84693878 0.81443299
 0.8556701  0.83505155 0.8556701  0.8877551 ]
Model 6: 10 scores -> [0.73469388 0.73469388 0.70408163 0.81632653 0.83673469 0.73195876
 0.81443299 0.72164948 0.78350515 0.76530612]
Model 7: 10 scores -> [0.80612245 0.83673469 0.80612245 0.85714286 0.86734694 0.82474227
 0.86597938 0.79381443 0.86597938 0.87755102]
Model 8: 10 scores -> [0.76530612 0.74489796 0.76530612 0.8877551  0.80612245 0.77319588
 0.80412371 0.7628866  0.78350515 0.78571429]
Model 9: 10 scores -> [0.76530612 0.79591837 0.74489796 0.82653061 0.81632653 0.77319588
 0.83505155 0.79381443 0.81443299 0.83673469]
Model 10: 10 scores -> [0.76530612 0.80612245 0.76530612 0.82653061 0.81632653 0.79381443
 0.81443299 0.7628866  0.82474227 0.85714286]
Model 11: 10 scores -> [0.69387755 0.7244898  0.67346939 0.70408163 0.69387755 0.7628866
 0.77319588 0.70103093 0.73195876 0.83673469]
Model 12: 10 scores -> [0.82653061 0.81632653 0.83673469 0.87755102 0.84693878 0.81443299
 0.8556701  0.83505155 0.8556701  0.8877551 ]
Model 13: 10 scores -> [0.73469388 0.73469388 0.70408163 0.81632653 0.83673469 0.73195876
 0.81443299 0.72164948 0.78350515 0.76530612]
Model 14: 10 scores -> [0.80612245 0.83673469 0.80612245 0.85714286 0.86734694 0.82474227
 0.86597938 0.79381443 0.86597938 0.87755102]
In [ ]:
import matplotlib.pyplot as plt

# Validate cv_results and model_names lengths
if len(cv_results) != len(model_names):
    print("⚠️ Error: Mismatch in model count and CV results length!")
    print(f"Number of models: {len(model_names)}")
    print(f"Number of CV results: {len(cv_results)}")
else:
    print("✅ Data structure looks correct!")

# Plot the boxplot only if the lengths match
if len(cv_results) == len(model_names):
    fig = plt.figure(figsize=(10, 7))
    fig.suptitle("Algorithm Comparison")
    ax = fig.add_subplot(111)

    plt.boxplot(cv_results)  # Use cv_results directly if it has the correct structure
    ax.set_xticklabels(model_names, rotation=45, ha="right")  # Rotate for readability

    plt.ylabel("Cross-Validation Score")
    plt.xlabel("Models")
    plt.show()
⚠️ Error: Mismatch in model count and CV results length!
Number of models: 7
Number of CV results: 15
In [ ]:
cv_result = cross_val_score(model, X_train, y_train, scoring=scoring, cv=kfold)
cv_results.append(cv_result)
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000437 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000403 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000257 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000246 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000264 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000278 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000273 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000276 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000278 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 878, number of negative: 4590
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000278 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160571 -> initscore=-1.653989
[LightGBM] [Info] Start training from score -1.653989
In [ ]:
cv_results.append(list(cv_result))  # Ensure it's stored as a list of lists
In [ ]:
cv_results = []  # Reset the list

for name, model in models:
    print(f"Processing model: {name}")  # Debugging line

    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

    try:
        cv_result = cross_val_score(model, X_train, y_train, scoring="recall", cv=kfold)
        cv_results.append(list(cv_result))  # Ensure it's a list of lists
    except Exception as e:
        print(f"⚠️ Error in model {name}: {e}")

print(f"Final CV results length: {len(cv_results)}")  # Debugging
Processing model: Bagging
Processing model: Random forest
Processing model: GBM
Processing model: Adaboost
Processing model: Xgboost
Processing model: dtree
Processing model: Light GBM
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000414 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000263 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000262 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000290 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 878, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000259 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771
[LightGBM] [Info] Start training from score -1.653771
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000254 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000252 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000265 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1181
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 879, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000254 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1179
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633
[LightGBM] [Info] Start training from score -1.652633
[LightGBM] [Info] Number of positive: 878, number of negative: 4590
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000268 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1180
[LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160571 -> initscore=-1.653989
[LightGBM] [Info] Start training from score -1.653989
Final CV results length: 7
In [ ]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(cv_results)
ax.set_xticklabels(model_names)

plt.show()

Observation:

It is observed that It appears Light GBM, XGBoost, GBM are the models with good potential. Ada Boost also looks good with the higher end outlier performance score

Oversampling train data using SMOTE¶ Our dataset has a huge imbalance in target variable labels. To deal with such datasets, we have a few tricks up our sleeves, which we call Imbalanced Classification.

One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

In [ ]:
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy="minority", k_neighbors=10, random_state=seed
)  # Synthetic Minority Over Sampling Technique

X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 976
Before UpSampling, counts of label 'No': 5099 

After UpSampling, counts of label 'Yes': 5099
After UpSampling, counts of label 'No': 5099 

After UpSampling, the shape of train_X: (10198, 12)
After UpSampling, the shape of train_y: (10198,) 

In [ ]:
for name, model in models:
    print(f"Processing model: {name}")  # Debugging line

    scoring = "recall"
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

    try:
        cv_result_over = cross_val_score(
            estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
        )  # No extra spaces before this line

        cv_results.append(cv_result_over)  # Append the result

        model.fit(X_train_over, y_train_over)  # Fit the model

        model_score_over = get_metrics_score(
            model, X_train_over, X_val, y_train_over, y_val
        )  # Ensure alignment

        add_score_model(name, model_score_over, cv_result_over.mean())  # Update scores

    except Exception as e:
        print(f"⚠️ Error in model {name}: {e}")  # Catch errors gracefully

print("Operation Completed!")
Processing model: Bagging
Model: Bagging, Score: [0.9966660129437145, 0.9452122408687068, 0.9974464741701041, 0.8062678062678063, 0.995881545401059, 0.8680981595092024, 0.9966633954857703, 0.8360413589364845, 0.9966660129437145, 0.9140490797546013], CV Result: 0.9539119380561655
Processing model: Random forest
Model: Random forest, Score: [1.0, 0.9605133267522211, 1.0, 0.847457627118644, 1.0, 0.9202453987730062, 1.0, 0.8823529411764706, 1.0, 0.9442403464453266], CV Result: 0.9760722678069262
Processing model: GBM
Model: GBM, Score: [0.968719356736615, 0.9575518262586377, 0.9638975155279503, 0.8333333333333334, 0.9739164542067072, 0.9202453987730062, 0.9688810847722173, 0.8746355685131195, 0.9687193567366151, 0.9424756405629737], CV Result: 0.9637177857390501
Processing model: Adaboost
Model: Adaboost, Score: [0.9366542459305747, 0.9136229022704837, 0.9282554337372572, 0.6751740139211136, 0.9464600902137674, 0.8926380368098159, 0.9372693726937269, 0.7688243064729194, 0.9366542459305746, 0.9051425478166726], CV Result: 0.945676258715667
Processing model: Xgboost
Model: Xgboost, Score: [0.9996077662286723, 0.9679170779861797, 0.9992161473642955, 0.8942598187311178, 1.0, 0.9079754601226994, 0.9996079200156832, 0.9010654490106544, 0.9996077662286723, 0.9436936124142908], CV Result: 0.9766612735467468
Processing model: dtree
Model: dtree, Score: [1.0, 0.9230009871668312, 1.0, 0.75, 1.0, 0.7822085889570553, 1.0, 0.7657657657657657, 1.0, 0.8661042944785277], CV Result: 0.9354744019415232
Processing model: Light GBM
[LightGBM] [Info] Number of positive: 4589, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000453 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4589, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000402 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4589, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000449 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4589, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000439 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4589, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000466 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4589, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000433 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4589, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000700 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4589, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000735 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4590, number of negative: 4589
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000694 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 9179, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500054 -> initscore=0.000218
[LightGBM] [Info] Start training from score 0.000218
[LightGBM] [Info] Number of positive: 4589, number of negative: 4590
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000720 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 9179, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499946 -> initscore=-0.000218
[LightGBM] [Info] Start training from score -0.000218
[LightGBM] [Info] Number of positive: 5099, number of negative: 5099
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000791 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2059
[LightGBM] [Info] Number of data points in the train set: 10198, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Model: Light GBM, Score: [0.9965679545008825, 0.9679170779861797, 0.9958871915393654, 0.887240356083086, 0.9972543636007061, 0.9171779141104295, 0.9965703086722195, 0.9019607843137255, 0.9965679545008825, 0.9474124864669795], CV Result: 0.9748957972186911
Operation Completed!
In [ ]:
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
    by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
Out[ ]:
  Model Cross_Val_Score_Train Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
1 Random forest 0.800221 1.000000 0.967423 1.000000 0.942177 1.000000 0.849693 1.000000 0.893548 1.000000 0.919847
2 GBM 0.803261 0.968724 0.969398 0.933775 0.934211 0.866803 0.871166 0.899044 0.901587 0.927518 0.929700
4 Xgboost 0.845266 1.000000 0.969891 1.000000 0.920635 1.000000 0.889571 1.000000 0.904836 1.000000 0.937432
6 Light GBM 0.840154 0.999177 0.970385 0.995914 0.918239 0.998975 0.895706 0.997442 0.906832 0.999095 0.940206
3 Adaboost 0.729560 0.942551 0.955084 0.871886 0.906574 0.753074 0.803681 0.808136 0.852033 0.865947 0.893899
0 Bagging 0.787881 0.996379 0.958045 0.995842 0.895082 0.981557 0.837423 0.988648 0.865293 0.990386 0.909300
11 Xgboost 0.976661 0.999608 0.967917 0.999216 0.894260 1.000000 0.907975 0.999608 0.901065 0.999608 0.943694
13 Light GBM 0.974896 0.996568 0.967917 0.995887 0.887240 0.997254 0.917178 0.996570 0.901961 0.996568 0.947412
8 Random forest 0.976072 1.000000 0.960513 1.000000 0.847458 1.000000 0.920245 1.000000 0.882353 1.000000 0.944240
9 GBM 0.963718 0.968719 0.957552 0.963898 0.833333 0.973916 0.920245 0.968881 0.874636 0.968719 0.942476
7 Bagging 0.953912 0.996666 0.945212 0.997446 0.806268 0.995882 0.868098 0.996663 0.836041 0.996666 0.914049
5 dtree 0.764338 1.000000 0.935341 1.000000 0.792793 1.000000 0.809816 1.000000 0.801214 1.000000 0.884614
12 dtree 0.935474 1.000000 0.923001 1.000000 0.750000 1.000000 0.782209 1.000000 0.765766 1.000000 0.866104
10 Adaboost 0.945676 0.936654 0.913623 0.928255 0.675174 0.946460 0.892638 0.937269 0.768824 0.936654 0.905143

Observation:

It is observed that The best 4 models with respect to validation recall and cross validation score, are as follows: Light GBM trained with over/up-sampled data GBM trained with over/up-sampled data AdaBoost trained with over/up-sampled data XGBoost trained with over/up-sampled data

Undersampling train data using Random Under Sampler Undersampling is another way of dealing with imbalance in the dataset.

Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset until a balanced dataset is created.

In [ ]:
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [ ]:
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976
Before Under Sampling, counts of label 'No': 5099 

After Under Sampling, counts of label 'Yes': 976
After Under Sampling, counts of label 'No': 976 

After Under Sampling, the shape of train_X: (1952, 12)
After Under Sampling, the shape of train_y: (1952,) 

Build Models with Undersampled Data Build and Train Models

In [ ]:
models_under = []

# Appending models into the list

models_under.append(("Bagging DownSampling", BaggingClassifier(random_state=seed)))
models_under.append(
    ("Random forest DownSampling", RandomForestClassifier(random_state=seed))
)
models_under.append(("GBM DownSampling", GradientBoostingClassifier(random_state=seed)))
models_under.append(("Adaboost DownSampling", AdaBoostClassifier(random_state=seed)))
models_under.append(
    ("Xgboost DownSampling", XGBClassifier(random_state=seed, eval_metric=loss_func))
)
models_under.append(("dtree DownSampling", DecisionTreeClassifier(random_state=seed)))
models_under.append(("Light GBM DownSampling", lgb.LGBMClassifier(random_state=seed)))
for name, model in models_under:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=10, shuffle=True, random_state=1
    )  # Setting number of splits equal to 10

    cv_result_under = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
    )
    cv_results.append(cv_result_under)

    model.fit(X_train_un, y_train_un)
    model_score_under = get_metrics_score(model, X_train_un, X_val, y_train_un, y_val)
    add_score_model(name, model_score_under, cv_result_under.mean())

print("Operation Completed!")
Model: Bagging DownSampling, Score: [0.9933401639344263, 0.9244817374136229, 0.9969040247678018, 0.7035294117647058, 0.9897540983606558, 0.9171779141104295, 0.9933161953727506, 0.796271637816245, 0.9933401639344261, 0.9215301335258029], CV Result: 0.9107721439091101
Model: Random forest DownSampling, Score: [1.0, 0.9318854886475815, 1.0, 0.7175925925925926, 1.0, 0.950920245398773, 1.0, 0.8179419525065963, 1.0, 0.93957776975821], CV Result: 0.9497264885335577
Model: GBM DownSampling, Score: [0.9697745901639344, 0.9378084896347483, 0.9608040201005025, 0.7347417840375586, 0.9795081967213115, 0.9601226993865031, 0.9700659563673262, 0.8324468085106383, 0.9697745901639344, 0.9468260555756045], CV Result: 0.9507889753839681
Model: Adaboost DownSampling, Score: [0.9287909836065574, 0.918558736426456, 0.9248730964467005, 0.6817155756207675, 0.9334016393442623, 0.9263803680981595, 0.9291177970423253, 0.7854356306892067, 0.9287909836065575, 0.9217195958137855], CV Result: 0.9220913107511048
Model: Xgboost DownSampling, Score: [1.0, 0.9387956564659428, 1.0, 0.7370892018779343, 1.0, 0.9631901840490797, 1.0, 0.8351063829787234, 1.0, 0.9486539155539516], CV Result: 0.9436040395539658
Model: dtree DownSampling, Score: [1.0, 0.8904244817374136, 1.0, 0.6065573770491803, 1.0, 0.9079754601226994, 1.0, 0.7272727272727273, 1.0, 0.8975171418260557], CV Result: 0.884146854618136
[LightGBM] [Info] Number of positive: 878, number of negative: 878
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000128 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1756, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 878, number of negative: 878
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000099 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1756, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 879, number of negative: 878
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000419 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1163
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138
[LightGBM] [Info] Start training from score 0.001138
[LightGBM] [Info] Number of positive: 879, number of negative: 878
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000343 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138
[LightGBM] [Info] Start training from score 0.001138
[LightGBM] [Info] Number of positive: 879, number of negative: 878
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000101 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138
[LightGBM] [Info] Start training from score 0.001138
[LightGBM] [Info] Number of positive: 879, number of negative: 878
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000098 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1163
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138
[LightGBM] [Info] Start training from score 0.001138
[LightGBM] [Info] Number of positive: 878, number of negative: 879
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000447 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1165
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138
[LightGBM] [Info] Start training from score -0.001138
[LightGBM] [Info] Number of positive: 878, number of negative: 879
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000328 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1163
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138
[LightGBM] [Info] Start training from score -0.001138
[LightGBM] [Info] Number of positive: 878, number of negative: 879
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000130 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138
[LightGBM] [Info] Start training from score -0.001138
[LightGBM] [Info] Number of positive: 878, number of negative: 879
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000350 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138
[LightGBM] [Info] Start training from score -0.001138
[LightGBM] [Info] Number of positive: 976, number of negative: 976
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000146 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1165
[LightGBM] [Info] Number of data points in the train set: 1952, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Model: Light GBM DownSampling, Score: [1.0, 0.9368213228035538, 1.0, 0.7323943661971831, 1.0, 0.9570552147239264, 1.0, 0.8297872340425532, 1.0, 0.9449981955972574], CV Result: 0.9476751525352409
Operation Completed!
In [ ]:
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)

# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
    by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
Out[ ]:
  Model Cross_Val_Score_Train Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
1 Random forest 0.800221 1.000000 0.967423 1.000000 0.942177 1.000000 0.849693 1.000000 0.893548 1.000000 0.919847
2 GBM 0.803261 0.968724 0.969398 0.933775 0.934211 0.866803 0.871166 0.899044 0.901587 0.927518 0.929700
4 Xgboost 0.845266 1.000000 0.969891 1.000000 0.920635 1.000000 0.889571 1.000000 0.904836 1.000000 0.937432
6 Light GBM 0.840154 0.999177 0.970385 0.995914 0.918239 0.998975 0.895706 0.997442 0.906832 0.999095 0.940206
3 Adaboost 0.729560 0.942551 0.955084 0.871886 0.906574 0.753074 0.803681 0.808136 0.852033 0.865947 0.893899
0 Bagging 0.787881 0.996379 0.958045 0.995842 0.895082 0.981557 0.837423 0.988648 0.865293 0.990386 0.909300
11 Xgboost 0.976661 0.999608 0.967917 0.999216 0.894260 1.000000 0.907975 0.999608 0.901065 0.999608 0.943694
13 Light GBM 0.974896 0.996568 0.967917 0.995887 0.887240 0.997254 0.917178 0.996570 0.901961 0.996568 0.947412
8 Random forest 0.976072 1.000000 0.960513 1.000000 0.847458 1.000000 0.920245 1.000000 0.882353 1.000000 0.944240
9 GBM 0.963718 0.968719 0.957552 0.963898 0.833333 0.973916 0.920245 0.968881 0.874636 0.968719 0.942476
7 Bagging 0.953912 0.996666 0.945212 0.997446 0.806268 0.995882 0.868098 0.996663 0.836041 0.996666 0.914049
5 dtree 0.764338 1.000000 0.935341 1.000000 0.792793 1.000000 0.809816 1.000000 0.801214 1.000000 0.884614
12 dtree 0.935474 1.000000 0.923001 1.000000 0.750000 1.000000 0.782209 1.000000 0.765766 1.000000 0.866104
18 Xgboost DownSampling 0.943604 1.000000 0.938796 1.000000 0.737089 1.000000 0.963190 1.000000 0.835106 1.000000 0.948654
16 GBM DownSampling 0.950789 0.969775 0.937808 0.960804 0.734742 0.979508 0.960123 0.970066 0.832447 0.969775 0.946826
20 Light GBM DownSampling 0.947675 1.000000 0.936821 1.000000 0.732394 1.000000 0.957055 1.000000 0.829787 1.000000 0.944998
15 Random forest DownSampling 0.949726 1.000000 0.931885 1.000000 0.717593 1.000000 0.950920 1.000000 0.817942 1.000000 0.939578
14 Bagging DownSampling 0.910772 0.993340 0.924482 0.996904 0.703529 0.989754 0.917178 0.993316 0.796272 0.993340 0.921530
17 Adaboost DownSampling 0.922091 0.928791 0.918559 0.924873 0.681716 0.933402 0.926380 0.929118 0.785436 0.928791 0.921720
10 Adaboost 0.945676 0.936654 0.913623 0.928255 0.675174 0.946460 0.892638 0.937269 0.768824 0.936654 0.905143
19 dtree DownSampling 0.884147 1.000000 0.890424 1.000000 0.606557 1.000000 0.907975 1.000000 0.727273 1.000000 0.897517

Observation:

The 4 best models are: XGBoost trained with undersampled data AdaBoost trained with undersampled data Light GBM trained with undersampled data GBM trained with undersampled data

In [ ]:
%%time

# defining model
model = XGBClassifier(random_state=seed, eval_metric=loss_func)


# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,500,50),
            'scale_pos_weight':[2,5,10],
            'learning_rate':[0.01,0.1,0.2,0.05],
            'gamma':[0,1,3,5],
            'subsample':[0.8,0.9,1],
            'max_depth':np.arange(4,20,1),
            'reg_lambda':[5,10, 15, 20]}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
xgb_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
xgb_tuned.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(xgb_tuned.best_params_,xgb_tuned.best_score_))
Best parameters are {'subsample': 1, 'scale_pos_weight': 10, 'reg_lambda': 10, 'n_estimators': 50, 'max_depth': 11, 'learning_rate': 0.01, 'gamma': 3} with CV score=1.0:
CPU times: user 3.53 s, sys: 464 ms, total: 3.99 s
Wall time: 2min 5s
In [ ]:
# building model with best parameters
xgb_tuned_model = XGBClassifier(
    n_estimators=150,
    scale_pos_weight=10,
    subsample=1,
    reg_lambda=20,
    max_depth=5,
    learning_rate=0.01,
    gamma=0,
    eval_metric=loss_func,
    random_state=seed,
)
# Fit the model on training data
xgb_tuned_model.fit(X_train_un, y_train_un)
Out[ ]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=0, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.01, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=5,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=150,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
xgb_tuned_model_score = get_metrics_score(
    xgb_tuned_model, X_train, X_val, y_train, y_val
)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

scoring = "recall"
xgb_down_cv = cross_val_score(
    estimator=xgb_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)


add_score_model(
    "XGB Tuned with Down Sampling", xgb_tuned_model_score, xgb_down_cv.mean()
)
Model: XGB Tuned with Down Sampling, Score: [0.6758847736625514, 0.6855873642645607, 0.33140916808149407, 0.33852544132917967, 1.0, 1.0, 0.4978321856669217, 0.5058184639255237, 0.8069229260639341, 0.8126470588235294], CV Result: 0.9969072164948454
In [ ]:
make_confusion_matrix(xgb_tuned_model, X_val, y_val)
In [ ]:
%%time

# defining model
model = AdaBoostClassifier(random_state=seed)



# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,2000,50),
            'learning_rate':[0.01,0.1,0.2,0.05]}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
ada_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
ada_tuned.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(ada_tuned.best_params_,ada_tuned.best_score_))
Best parameters are {'n_estimators': 1600, 'learning_rate': 0.2} with CV score=0.9405533347359564:
CPU times: user 25.6 s, sys: 3.77 s, total: 29.4 s
Wall time: 28min 6s
In [ ]:
# building model with best parameters
ada_tuned_model = AdaBoostClassifier(
    n_estimators=1050, learning_rate=0.1, random_state=seed
)
# Fit the model on training data
ada_tuned_model.fit(X_train_un, y_train_un)
Out[ ]:
AdaBoostClassifier(learning_rate=0.1, n_estimators=1050, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
ada_tuned_model_score = get_metrics_score(
    ada_tuned_model, X_train, X_val, y_train, y_val
)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

scoring = "recall"
ada_down_cv = cross_val_score(
    estimator=ada_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)


add_score_model(
    "AdaBoost Tuned with Down Sampling", ada_tuned_model_score, ada_down_cv.mean()
)
Model: AdaBoost Tuned with Down Sampling, Score: [0.9132510288065844, 0.9180651530108588, 0.6604717655468192, 0.6777777777777778, 0.9467213114754098, 0.9355828220858896, 0.7781052631578947, 0.7860824742268041, 0.9267828953925392, 0.9251443522194154], CV Result: 0.9374395118872292

Confusion matrix on validation

In [ ]:
make_confusion_matrix(ada_tuned_model, X_val, y_val)

Tuning Light GBM with Down-Sampled data

In [ ]:
%%time

# defining model
model = lgb.LGBMClassifier(random_state=seed)

# Hyper parameters
min_gain_to_split = [0.01, 0.1, 0.2, 0.3]
min_data_in_leaf = [10, 20, 30, 40, 50]
feature_fraction = [0.8, 0.9, 1.0]
max_depth = [5, 8, 15, 25, 30]
extra_trees = [True, False]
learning_rate = [0.01,0.1,0.2,0.05]

# Parameter grid to pass in RandomizedSearchCV
param_grid={'min_gain_to_split': min_gain_to_split,
               'min_data_in_leaf': min_data_in_leaf,
               'feature_fraction': feature_fraction,
               'max_depth': max_depth,
               'extra_trees': extra_trees,
               'learning_rate': learning_rate,
               'boosting_type': ['gbdt'],
               'objective': ['binary'],
               'is_unbalance': [True],
               'metric': ['binary_logloss'],}
               # Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
lgbm_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
lgbm_tuned.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(lgbm_tuned.best_params_,lgbm_tuned.best_score_))
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 976, number of negative: 976
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000338 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1165
[LightGBM] [Info] Number of data points in the train set: 1952, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Best parameters are {'objective': 'binary', 'min_gain_to_split': 0.01, 'min_data_in_leaf': 40, 'metric': 'binary_logloss', 'max_depth': 15, 'learning_rate': 0.05, 'is_unbalance': True, 'feature_fraction': 0.8, 'extra_trees': False, 'boosting_type': 'gbdt'} with CV score=0.953892278560909:
CPU times: user 2.38 s, sys: 267 ms, total: 2.65 s
Wall time: 55.8 s

Building the model with the resulted best parameters¶

In [ ]:
lgbm_tuned_model = lgb.LGBMClassifier(
               min_gain_to_split = 0.01,
               min_data_in_leaf = 50,
               feature_fraction = 0.8,
               max_depth = 8,
               extra_trees = False,
               learning_rate = 0.2,
               objective = 'binary',
               metric = 'binary_logloss',
               is_unbalance = True,
               boosting_type = 'gbdt',
               random_state = seed
)
# Fit the model on training data
lgbm_tuned_model.fit(X_train_un, y_train_un)
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 976, number of negative: 976
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000314 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1165
[LightGBM] [Info] Number of data points in the train set: 1952, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Out[ ]:
LGBMClassifier(extra_trees=False, feature_fraction=0.8, is_unbalance=True,
               learning_rate=0.2, max_depth=8, metric='binary_logloss',
               min_data_in_leaf=50, min_gain_to_split=0.01, objective='binary',
               random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
lgbm_tuned_model_score = get_metrics_score(
    lgbm_tuned_model, X_train, X_val, y_train, y_val
)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

scoring = "recall"
lgb_down_cv = cross_val_score(
    estimator=lgbm_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)


add_score_model(
    "Light GBM Tuned with Down Sampling", lgbm_tuned_model_score, lgb_down_cv.mean()
)
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 878, number of negative: 878
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000422 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1756, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 878, number of negative: 878
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000487 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1756, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 879, number of negative: 878
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000461 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1163
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138
[LightGBM] [Info] Start training from score 0.001138
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 879, number of negative: 878
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000442 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138
[LightGBM] [Info] Start training from score 0.001138
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 879, number of negative: 878
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000159 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138
[LightGBM] [Info] Start training from score 0.001138
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 879, number of negative: 878
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000432 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1163
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138
[LightGBM] [Info] Start training from score 0.001138
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 878, number of negative: 879
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000133 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1165
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138
[LightGBM] [Info] Start training from score -0.001138
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 878, number of negative: 879
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000450 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1163
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138
[LightGBM] [Info] Start training from score -0.001138
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 878, number of negative: 879
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000289 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138
[LightGBM] [Info] Start training from score -0.001138
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 878, number of negative: 879
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000122 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1164
[LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138
[LightGBM] [Info] Start training from score -0.001138
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
Model: Light GBM Tuned with Down Sampling, Score: [0.9519341563786008, 0.941263573543929, 0.7697160883280757, 0.7482014388489209, 1.0, 0.9570552147239264, 0.8698752228163993, 0.8398384925975774, 0.971366934693077, 0.9476452544207867], CV Result: 0.949737008205344
In [ ]:
make_confusion_matrix(lgbm_tuned_model, X_val, y_val)
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8

Tuning GBM with Down Sampled data

In [ ]:
%%time

# defining model
model = GradientBoostingClassifier(random_state=seed)

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10, 15]


# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
gbm_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
gbm_tuned.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(gbm_tuned.best_params_,gbm_tuned.best_score_))
/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py:528: FitFailedWarning: 
230 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
230 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of GradientBoostingClassifier must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'log2', 'sqrt'} or None. Got 'auto' instead.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_search.py:1108: UserWarning: One or more of the test scores are non-finite: [0.94775931 0.94571849 0.94468757 0.94058489 0.94568693        nan
        nan        nan        nan 0.94772775 0.94672838 0.94670734
        nan 0.94875868 0.94158426        nan        nan        nan
        nan 0.93953293        nan 0.94464549        nan        nan
 0.94875868 0.94671786 0.94569745        nan 0.94058489 0.94978961
        nan 0.94568693        nan        nan 0.94465601 0.94364612
 0.94671786 0.95183042 0.94468757        nan        nan        nan
        nan 0.94364612        nan 0.94978961 0.94362508        nan
 0.95387124        nan]
  warnings.warn(
Best parameters are {'n_estimators': 483, 'min_samples_split': 2, 'min_samples_leaf': 15, 'max_features': 'sqrt', 'max_depth': 15} with CV score=0.9538712392173364:
CPU times: user 19.3 s, sys: 2.79 s, total: 22.1 s
Wall time: 20min 54s
In [ ]:
gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features="sqrt",  # Change "auto" to "sqrt" or another valid option
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=seed,
)
In [ ]:
gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features="sqrt",
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=seed,
)
# Fit the model on training data
gbm_tuned_model.fit(X_train_un, y_train_un)
Out[ ]:
GradientBoostingClassifier(max_depth=25, max_features='sqrt',
                           min_samples_leaf=15, n_estimators=700,
                           random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
gbm_tuned_model_score = get_metrics_score(
    gbm_tuned_model, X_train, X_val, y_train, y_val
)


kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

scoring = "recall"
gbm_down_cv = cross_val_score(
    estimator=gbm_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)


add_score_model(
    "GBM Tuned with Down Sampling", gbm_tuned_model_score, gbm_down_cv.mean()
)
Model: GBM Tuned with Down Sampling, Score: [0.9545679012345679, 0.945705824284304, 0.7795527156549521, 0.7621359223300971, 1.0, 0.9631901840490797, 0.8761220825852782, 0.8509485094850948, 0.9729358697783879, 0.9527715626127752], CV Result: 0.952850831054071
In [ ]:
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)


for col in comparison_frame.select_dtypes(include="float64").columns.tolist():
    comparison_frame[col] = round(comparison_frame[col] * 100, 0).astype(int)


comparison_frame.tail(4).sort_values(
    by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
)
Out[ ]:
Model Cross_Val_Score_Train Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
21 XGB Tuned with Down Sampling 100 68 69 33 34 100 100 50 51 81 81
24 GBM Tuned with Down Sampling 95 95 95 78 76 100 96 88 85 97 95
23 Light GBM Tuned with Down Sampling 95 95 94 77 75 100 96 87 84 97 95
22 AdaBoost Tuned with Down Sampling 94 91 92 66 68 95 94 78 79 93 93
In [ ]:
feature_names = X_train.columns
importances = gbm_tuned_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
In [ ]:
print(X_train.dtypes)
print(X_train.head())
gender                        int64
education_level               int64
income_category               int64
card_category                 int64
total_relationship_count    float64
months_inactive_12_mon      float64
contacts_count_12_mon       float64
total_revolving_bal         float64
total_amt_chng_q4_q1        float64
total_trans_amt             float64
total_trans_ct              float64
total_ct_chng_q4_q1         float64
dtype: object
      gender  education_level  income_category  card_category  \
800        1                6                0              0   
498        1                6                5              0   
4356       1                3                3              0   
407        1                2                2              3   
8728       1                3                1              3   

      total_relationship_count  months_inactive_12_mon  contacts_count_12_mon  \
800                      3.000                   4.000                  3.000   
498                      3.000                   2.000                  0.000   
4356                     2.500                   1.000                  2.000   
407                      3.000                   2.000                  0.000   
8728                     1.000                   2.000                  3.000   

      total_revolving_bal  total_amt_chng_q4_q1  total_trans_amt  \
800                 1.226                 2.044            0.648   
498                 1.450                 1.697            0.524   
4356                1.926                 3.829            1.661   
407                 0.000                 2.675            0.464   
8728                1.037                 3.307            2.971   

      total_trans_ct  total_ct_chng_q4_q1  
800            1.278                2.249  
498            0.861                2.667  
4356           2.194                3.717  
407            1.083                1.266  
8728           2.333                3.165  
In [ ]:
print(X_train.select_dtypes(include=['object']).head())
Empty DataFrame
Columns: []
Index: [800, 498, 4356, 407, 8728]
In [ ]:
from sklearn.preprocessing import LabelEncoder

# Get the actual categorical column names
categorical_cols = X_train.select_dtypes(include=['category', 'object']).columns.tolist()  # Include both 'category' and 'object' types

# Apply Label Encoding to each categorical column
for col in categorical_cols:
    le = LabelEncoder()  # Create a new LabelEncoder for each column
    X_train[col] = le.fit_transform(X_train[col])
    X_test[col] = le.transform(X_test[col])
In [ ]:
print(X_train.columns)  # Check available columns
Index(['gender', 'education_level', 'income_category', 'card_category',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'total_revolving_bal', 'total_amt_chng_q4_q1',
       'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1'],
      dtype='object')
In [ ]:
print(X_train.dtypes)
gender                        int64
education_level               int64
income_category               int64
card_category                 int64
total_relationship_count    float64
months_inactive_12_mon      float64
contacts_count_12_mon       float64
total_revolving_bal         float64
total_amt_chng_q4_q1        float64
total_trans_amt             float64
total_trans_ct              float64
total_ct_chng_q4_q1         float64
dtype: object
In [ ]:
final_acc_test = 0.0  # or some computed value
In [ ]:
if "final_acc_test" not in locals():
    final_acc_test = 0.0  # Set a default value
In [ ]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
column_name = "actual_column_name_here"  # Replace with the correct column name

if column_name in X_train.columns:
    X_train[column_name] = le.fit_transform(X_train[column_name])
    X_test[column_name] = le.transform(X_test[column_name])
else:
    print(f"Column '{column_name}' not found in X_train")
Column 'actual_column_name_here' not found in X_train
In [ ]:
print([col for col in X_train.columns if "your_column_name" in col])
[]
In [ ]:
final_recall_train = 0.0
In [ ]:
if "final_recall_train" not in locals():
    final_recall_train = 0.0
In [ ]:
final_recall_test = 0.0
In [ ]:
if "final_recall_test" not in locals():
    final_recall_test = 0.0
In [ ]:
final_precision_train =0.0
In [ ]:
if "final_precision_train " not in locals():
    final_precision_train  = 0.0
In [ ]:
final_precision_test =0.0
In [ ]:
if "final_precision_test " not in locals():
    final_precision_test  = 0.0
In [ ]:
final_f1_train =0.0
In [ ]:
if "final_f1_train " not in locals():
    final_f1_train = 0.0
In [ ]:
final_f1_test =0.0
In [ ]:
if "final_f1_test " not in locals():
    final_f1_test = 0.0
In [ ]:
final_roc_auc_train =0.0
In [ ]:
if "final_roc_auc_train " not in locals():
    final_roc_auc_train = 0.0
In [ ]:
final_roc_auc_test =0.0
In [ ]:
if "final_roc_auc_test " not in locals():
    final_roc_auc_test = 0.0
In [ ]:
print(X_train.info())  # Check data types
print(X_train.head())  # Inspect first few rows
print(X_train.select_dtypes(include=['object']).head())  # Show non-numeric columns
<class 'pandas.core.frame.DataFrame'>
Index: 6075 entries, 800 to 4035
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   gender                    6075 non-null   int64  
 1   education_level           6075 non-null   int64  
 2   income_category           6075 non-null   int64  
 3   card_category             6075 non-null   int64  
 4   total_relationship_count  6075 non-null   float64
 5   months_inactive_12_mon    6075 non-null   float64
 6   contacts_count_12_mon     6075 non-null   float64
 7   total_revolving_bal       6075 non-null   float64
 8   total_amt_chng_q4_q1      6075 non-null   float64
 9   total_trans_amt           6075 non-null   float64
 10  total_trans_ct            6075 non-null   float64
 11  total_ct_chng_q4_q1       6075 non-null   float64
dtypes: float64(8), int64(4)
memory usage: 617.0 KB
None
      gender  education_level  income_category  card_category  \
800        1                6                0              0   
498        1                6                5              0   
4356       1                3                3              0   
407        1                2                2              3   
8728       1                3                1              3   

      total_relationship_count  months_inactive_12_mon  contacts_count_12_mon  \
800                      3.000                   4.000                  3.000   
498                      3.000                   2.000                  0.000   
4356                     2.500                   1.000                  2.000   
407                      3.000                   2.000                  0.000   
8728                     1.000                   2.000                  3.000   

      total_revolving_bal  total_amt_chng_q4_q1  total_trans_amt  \
800                 1.226                 2.044            0.648   
498                 1.450                 1.697            0.524   
4356                1.926                 3.829            1.661   
407                 0.000                 2.675            0.464   
8728                1.037                 3.307            2.971   

      total_trans_ct  total_ct_chng_q4_q1  
800            1.278                2.249  
498            0.861                2.667  
4356           2.194                3.717  
407            1.083                1.266  
8728           2.333                3.165  
Empty DataFrame
Columns: []
Index: [800, 498, 4356, 407, 8728]
In [ ]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
In [ ]:
# Ensure both datasets have the same columns
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)
In [ ]:
gbm_tuned_model_test_score = get_metrics_score(
    gbm_tuned_model, X_train, X_test, y_train, y_test
)
In [ ]:
gbm_tuned_model_test_score = get_metrics_score(
    gbm_tuned_model, X_train, X_test, y_train, y_test
)

final_model_names = ["gbm Tuned Down-sampled Trained"]
final_acc_train = [gbm_tuned_model_test_score[0]]
final_acc_test = [gbm_tuned_model_test_score[1]]
final_recall_train = [gbm_tuned_model_test_score[2]]
final_recall_test = [gbm_tuned_model_test_score[3]]
final_precision_train = [gbm_tuned_model_test_score[4]]
final_precision_test = [gbm_tuned_model_test_score[5]]
final_f1_train = [gbm_tuned_model_test_score[6]]
final_f1_test = [gbm_tuned_model_test_score[7]]
final_roc_auc_train = [gbm_tuned_model_test_score[8]]
final_roc_auc_test = [gbm_tuned_model_test_score[9]]

final_result_score = pd.DataFrame(
    {
        "Model": final_model_names,
        "Train_Accuracy": final_acc_train,
        "Test_Accuracy": final_acc_test,
        "Train_Recall": final_recall_train,
        "Test_Recall": final_recall_test,
        "Train_Precision": final_precision_train,
        "Test_Precision": final_precision_test,
        "Train_F1": final_f1_train,
        "Test_F1": final_f1_test,
        "Train_ROC_AUC": final_roc_auc_train,
        "Test_ROC_AUC": final_roc_auc_test,
    }
)


for col in final_result_score.select_dtypes(include="float64").columns.tolist():
    final_result_score[col] = final_result_score[col] * 100


final_result_score
Out[ ]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 Train_ROC_AUC Test_ROC_AUC
0 gbm Tuned Down-sampled Trained 95.457 93.189 77.955 70.917 100.000 97.538 87.612 82.124 97.294 94.948
In [ ]:
make_confusion_matrix(gbm_tuned_model, X_test, y_test)

Giant Chart

In [ ]:
!pip install scipy==1.10 scikit-plot --upgrade
Collecting scipy==1.10
  Downloading scipy-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.9/58.9 kB 4.2 MB/s eta 0:00:00
Requirement already satisfied: scikit-plot in /usr/local/lib/python3.11/dist-packages (0.3.7)
Requirement already satisfied: numpy<1.27.0,>=1.19.5 in /usr/local/lib/python3.11/dist-packages (from scipy==1.10) (1.26.4)
Requirement already satisfied: matplotlib>=1.4.0 in /usr/local/lib/python3.11/dist-packages (from scikit-plot) (3.10.0)
Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.11/dist-packages (from scikit-plot) (1.6.1)
Requirement already satisfied: joblib>=0.10 in /usr/local/lib/python3.11/dist-packages (from scikit-plot) (1.4.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=1.4.0->scikit-plot) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=1.4.0->scikit-plot) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=1.4.0->scikit-plot) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=1.4.0->scikit-plot) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=1.4.0->scikit-plot) (24.2)
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=1.4.0->scikit-plot) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=1.4.0->scikit-plot) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.11/dist-packages (from matplotlib>=1.4.0->scikit-plot) (2.8.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=0.18->scikit-plot) (3.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.7->matplotlib>=1.4.0->scikit-plot) (1.17.0)
Downloading scipy-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.1/34.1 MB 14.6 MB/s eta 0:00:00
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.13.1
    Uninstalling scipy-1.13.1:
      Successfully uninstalled scipy-1.13.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scikit-image 0.25.1 requires scipy>=1.11.2, but you have scipy 1.10.0 which is incompatible.
imbalanced-learn 0.13.0 requires scipy<2,>=1.10.1, but you have scipy 1.10.0 which is incompatible.
Successfully installed scipy-1.10.0
In [ ]:
from numpy import interp
In [ ]:
!pip install scipy==1.11.4
Collecting scipy==1.11.4
  Downloading scipy-1.11.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.4/60.4 kB 3.4 MB/s eta 0:00:00
Requirement already satisfied: numpy<1.28.0,>=1.21.6 in /usr/local/lib/python3.11/dist-packages (from scipy==1.11.4) (1.26.4)
Downloading scipy-1.11.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.4/36.4 MB 38.4 MB/s eta 0:00:00
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.10.0
    Uninstalling scipy-1.10.0:
      Successfully uninstalled scipy-1.10.0
Successfully installed scipy-1.11.4
In [ ]:
gbm_tuned_model.fit(X_train_encoded, y_train)  # Train the model with your encoded training data
y_pred_prob = gbm_tuned_model.predict_proba(X_test_encoded)  # Now predict
In [ ]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Identify categorical features (assuming they are of 'object' type)
categorical_features = X_test.select_dtypes(include=['object', 'category']).columns.tolist()

# Create a ColumnTransformer to apply OneHotEncoder to categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', X_test.select_dtypes(exclude=['object', 'category']).columns.tolist()),
        ('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features)])

# Fit and transform the preprocessor on your training data (X_train)
# This ensures that the same encoding is applied to both training and test data
X_train_encoded = preprocessor.fit_transform(X_train)

# Transform the test data (X_test) using the fitted preprocessor
X_test_encoded = preprocessor.transform(X_test)

# Now you can use X_test_encoded with your model's predict_proba method
y_pred_prob = gbm_tuned_model.predict_proba(X_test_encoded)

# Continue with the rest of your code...
In [ ]:
from sklearn.metrics import RocCurveDisplay

# Assuming 'gbm_tuned_model', 'X_test', and 'y_test' are defined
RocCurveDisplay.from_estimator(gbm_tuned_model, X_test, y_test)

plt.title("Receiver Operating Characteristic")
plt.legend(loc="lower right")
plt.plot([0, 1], [0, 1], "b--")
plt.xlim([-0.05, 1])
plt.ylim([0, 1.05])
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()
In [ ]:
seed = 1
loss_func = "logloss"

# Test and Validation sizes
test_size = 0.2
val_size = 0.25

# Dependent Varibale Value map
target_mapper = {"Attrited Customer": 1, "Existing Customer": 0}

df_pipe = data.copy()
cat_columns = df_pipe.select_dtypes(include="object").columns.tolist()
df_pipe[cat_columns] = df_pipe[cat_columns].astype("category")
In [ ]:
X = df_pipe.drop(columns=["attrition_flag"])  # Replace 'data_pipe' with 'df_pipe'
y = df_pipe["attrition_flag"].map(target_mapper)  # Replace 'data_pipe' with 'df_pipe'
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-19-7741e6d25a6a> in <cell line: 0>()
----> 1 X = df_pipe.drop(columns=["attrition_flag"])  # Replace 'data_pipe' with 'df_pipe'
      2 y = df_pipe["attrition_flag"].map(target_mapper)  # Replace 'data_pipe' with 'df_pipe'

/usr/local/lib/python3.11/dist-packages/pandas/core/frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   5579                 weight  1.0     0.8
   5580         """
-> 5581         return super().drop(
   5582             labels=labels,
   5583             axis=axis,

/usr/local/lib/python3.11/dist-packages/pandas/core/generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4786         for axis, labels in axes.items():
   4787             if labels is not None:
-> 4788                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4789 
   4790         if inplace:

/usr/local/lib/python3.11/dist-packages/pandas/core/generic.py in _drop_axis(self, labels, axis, level, errors, only_slice)
   4828                 new_axis = axis.drop(labels, level=level, errors=errors)
   4829             else:
-> 4830                 new_axis = axis.drop(labels, errors=errors)
   4831             indexer = axis.get_indexer(new_axis)
   4832 

/usr/local/lib/python3.11/dist-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
   7068         if mask.any():
   7069             if errors != "ignore":
-> 7070                 raise KeyError(f"{labels[mask].tolist()} not found in axis")
   7071             indexer = indexer[~mask]
   7072         return self.delete(indexer)

KeyError: "['attrition_flag'] not found in axis"
In [ ]:
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=test_size, random_state=seed, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=val_size, random_state=seed, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 17) (2026, 17) (2026, 17)
In [ ]:
print(y_train.value_counts(normalize=True))
print(y_val.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
attrition_flag
0   0.839
1   0.161
Name: proportion, dtype: float64
attrition_flag
0   0.839
1   0.161
Name: proportion, dtype: float64
attrition_flag
0   0.840
1   0.160
Name: proportion, dtype: float64
In [ ]:
under_sample = RandomUnderSampler(random_state=seed)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [ ]:
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age",
]

# For masking a particular value in a feature
column_to_mask_value = "income_category"
value_to_mask = "abc"
masked_value = "Unknown"

# One-hot encoding columns
columns_to_encode = [
    "gender",
    "education_level",
    "marital_status",
     "income_category",
    "card_category",
]

# Numerical Columns
num_columns = [
    "total_relationship_count",
    "months_inactive_12_mon",
    "contacts_count_12_mon",
    "total_revolving_bal",
    "total_amt_chng_q4_q1",
    "total_trans_amt",
    "total_trans_ct",
    "total_ct_chng_q4_q1",
    "avg_utilization_ratio",
]
columns_to_null_imp_unknown = ["education_level", "marital_status"]
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class FillUnknown(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        for col in X_.columns:
            # Access dtype of the Series using .dtypes
            if X_[col].dtypes.name == 'category':
                # Check if 'Unknown' is already a category before adding it
                if 'Unknown' not in X_[col].cat.categories:
                    X_[col] = X_[col].cat.add_categories('Unknown')
                X_[col] = X_[col].fillna('Unknown')
        return X_
In [ ]:
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age",
     "marital_status",  # Remove marital_status from columns_to_drop
]
In [ ]:
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age",
    # "marital_status",  # Remove this line to keep 'marital_status' in the data
]

feature_name_standardizer = FeatureNamesStandardizer()

# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)

# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
    feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)

# Missing value imputation
imputer = FillUnknown()

# To encode the categorical data
one_hot = OneHotEncoder(handle_unknown="ignore")

columns_to_encode = [
    "gender",
    "education_level",
    "marital_status",  # This column should be present for encoding
    "income_category",
    "card_category",
]
scaler = RobustScaler()


# creating a transformer for feature name standardization and dropping columns
cleanser = Pipeline(
    steps=[
        ("feature_name_standardizer", feature_name_standardizer),
        ("column_dropper", column_dropper),
        ("value_mask", value_masker),
        ("imputation", imputer),
    ]
)

# creating a transformer for data encoding

encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])
preprocessor = ColumnTransformer(
    transformers=[
        ("encoding", encode_transformer, columns_to_encode),
        ("scaling", num_scaler, num_columns),
    ],
    remainder="passthrough",
)

# Model

gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features="auto",
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=seed,
)
In [ ]:
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    # "customer_age",  # Remove 'customer_age' from columns_to_drop as it's needed in num_columns
    # "marital_status",  # Remove marital_status from columns_to_drop
]

# ... (Rest of your code remains the same) ...

num_columns = [
    "total_relationship_count",
    "months_inactive_12_mon",
    "contacts_count_12_mon",
    "total_revolving_bal",
    "total_amt_chng_q4_q1",
    "total_trans_amt",
    "total_trans_ct",
    "total_ct_chng_q4_q1",
    "avg_utilization_ratio",
    "customer_age",  # Include 'customer_age' explicitly in num_columns
]
In [ ]:
# ... (rest of the code) ...

# Initialize the OneHotEncoder
from sklearn.preprocessing import OneHotEncoder # Import OneHotEncoder if not already imported
one_hot = OneHotEncoder(handle_unknown="ignore")

# Creating a transformer for data encoding
encode_transformer = Pipeline(steps=[("onehot", one_hot)])

# ... (rest of the code) ...
In [ ]:
# ... (rest of the code) ...

# Instantiate the RobustScaler
from sklearn.preprocessing import RobustScaler  # Import RobustScaler if not already imported
scaler = RobustScaler()  # Create an instance of RobustScaler

# Define the columns_to_encode list
columns_to_encode = [
    "gender",
    "education_level",
    "marital_status",
    "income_category",
    "card_category",
]

# Create a transformer for data encoding
encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])  # Now 'scaler' is defined

preprocessor = ColumnTransformer(
    transformers=[
        ("encoding", encode_transformer, columns_to_encode),
        ("scaling", num_scaler, num_columns),
    ],
    remainder="passthrough",
)

# ... (rest of the code) ...
In [ ]:
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age",
    "marital_status",  # Remove this line to keep 'marital_status' in the data
]

# ... (rest of the code) ...

# Creating a transformer for data encoding

encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])
preprocessor = ColumnTransformer(
    transformers=[
        ("encoding", encode_transformer, columns_to_encode),
        ("scaling", num_scaler, num_columns),
    ],
    remainder="passthrough",
)

# ... (rest of the code) ...
In [ ]:
print(type(X_train_un))
<class 'pandas.core.frame.DataFrame'>
In [ ]:
from sklearn.base import BaseEstimator, TransformerMixin

class FillUnknown(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        for col in X_.columns:
            # Access dtype of the Series using .dtypes
            if X_[col].dtypes.name == 'category':
                X_[col] = X_[col].cat.add_categories('Unknown')
                X_[col] = X_[col].fillna('Unknown')
        return X_
In [ ]:
# ipython-input-80-1cce1b5a049f
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age",
     "marital_status",  # Remove this line to keep 'marital_status' in the data
]

# ... (rest of your code) ...
In [ ]:
feature_name_standardizer = FeatureNamesStandardizer()

# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)

# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
    feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)

# Missing value imputation
imputer = FillUnknown()

# To encode the categorical data
one_hot = OneHotEncoder(handle_unknown="ignore")
scaler = RobustScaler()


# creating a transformer for feature name standardization and dropping columns
cleanser = Pipeline(
    steps=[
        ("feature_name_standardizer", feature_name_standardizer),
        ("column_dropper", column_dropper),
        ("value_mask", value_masker),
        ("imputation", imputer),
    ]
)

# creating a transformer for data encoding

encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])

preprocessor = ColumnTransformer(
    transformers=[
        ("encoding", encode_transformer, columns_to_encode),
        ("scaling", num_scaler, num_columns),
         ],
    remainder="passthrough",
)
gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features="auto",
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=seed,
)

# Creating new pipeline with best parameters
model_pipe = Pipeline(
    steps=[
        ("cleanse", cleanser),
        ("preprocess", preprocessor),
        ("model", gbm_tuned_model),
    ]
)
# Fit the model on training data
model_pipe.fit(X_train_un, y_train_un)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-94-7a56560dd77f> in <cell line: 0>()
     57 )
     58 # Fit the model on training data
---> 59 model_pipe.fit(X_train_un, y_train_un)

NameError: name 'X_train_un' is not defined
In [ ]:
print("Is X_train_un defined?", 'X_train_un' in globals())
print("Is y_train_un defined?", 'y_train_un' in globals())
Is X_train_un defined? False
Is y_train_un defined? False
In [ ]:
print(vars().keys())  # Lists all defined variables
dict_keys(['__name__', '__doc__', '__package__', '__loader__', '__spec__', '__builtin__', '__builtins__', '_ih', '_oh', '_dh', 'In', 'Out', 'get_ipython', 'exit', 'quit', '_', '__', '___', '_i', '_ii', '_iii', '_i1', 'ColumnTransformer', 'OneHotEncoder', 'StandardScaler', 'preprocessor', '_i2', 'train_test_split', '_i3', 'pd', 'data', '_i4', 'np', '_i5', 'plt', 'sns', '_i6', 'SimpleImputer', '_i7', 'LogisticRegression', 'DecisionTreeClassifier', 'AdaBoostClassifier', 'GradientBoostingClassifier', 'RandomForestClassifier', 'BaggingClassifier', '_i8', 'XGBClassifier', '_exit_code', 'lgb', '_i9', 'metrics', 'StratifiedKFold', 'cross_val_score', 'f1_score', 'accuracy_score', 'recall_score', 'precision_score', 'confusion_matrix', 'roc_auc_score', 'ConfusionMatrixDisplay', 'RocCurveDisplay', '_i10', 'MinMaxScaler', 'RobustScaler', '_i11', 'GridSearchCV', 'RandomizedSearchCV', 'Pipeline', 'TransformerMixin', 'SMOTE', 'RandomUnderSampler', '_i12', 'ProfileReport', '_i13', 'df', '_i14', 'additional_droppable_columns', 'col', '_i15', '_i16', 'columns_to_drop', '_i17', '_i18', 'seed', 'loss_func', 'test_size', 'val_size', 'target_mapper', 'df_pipe', 'cat_columns', '_i19', '_i20', 'category_unique_value', '_i21', '_i22', 'marital_status_col', '_i23', '_i24', '_i25', 'category_columns', '_i26', '_i27', '_i28', '_28', '_i29', 'summary', '_i30', '_i31', 'perc_on_bar', '_i32', 'box_by_target', '_i33', 'cat_view', '_i34', 'feature_name_standardize', 'drop_feature', 'mask_value', 'impute_category_unknown', '_i35', '_i36', 'column_to_mask_value', 'value_to_mask', 'masked_value', '_i37', '_i38', '_i39', 'X', '_i40', '_i41', 'y', '_i42', 'X_temp', 'X_test', 'y_temp', 'y_test', 'X_train', 'X_val', 'y_train', 'y_val', '_i43', '_i44', 'BaseEstimator', 'FeatureNamesStandardizer', '_i45', '_i46', '_i47', '_i48', 'ColumnDropper', '_i49', '_i50', '_i51', '_i52', 'feature_name_standardizer', 'column_dropper', '_i53', 'CustomValueMasker', '_i54', 'robust_scaler', 'num_columns', '_i55', '_i56', 'get_metrics_score', '_i57', '_i58', '_i59', '_i60', 'make_confusion_matrix', '_i61', 'model_names', 'acc_train', 'acc_test', 'recall_train', 'recall_test', 'precision_train', 'precision_test', 'f1_train', 'f1_test', 'roc_auc_train', 'roc_auc_test', 'cross_val_train', '_i62', 'add_score_model', '_i63', 'models', 'cv_results', '_i64', 'categorical_cols', '_i65', '_i66', '_i67', 'X_train_encoded', 'X_val_encoded', '_i68', 'LabelEncoder', 'label_encoders', 'le', '_i69', '_i70', '_i71', '_i72', '_i73', '_i74', '_i75', '_i76', 'one_hot', 'encode_transformer', '_i77', '_i78', 'scaler', 'num_scaler', '_i79', 'columns_to_encode', '_i80', '_i81', '_i82', '_i83', 'columns_to_scale', '_i84', 'value_masker', '_i85', '_i86', 'FillUnknown', '_i87', 'imputer', 'cleanser', 'gbm_tuned_model', 'model_pipe', '_i88', '_i89', '_i90', '_i91', '_i92', '_i93', '_i94', '_i95', '_i96', '_i97', '_i98', '_i99'])
In [ ]:
X_train_un, X_test_un, y_train_un, y_test_un = train_test_split(X, y, test_size=0.2, random_state=42)
In [ ]:
print(X_train_un.shape)  # Check if it contains data
(8101, 17)
In [ ]:
# To Standardize feature names
feature_name_standardizer = FeatureNamesStandardizer()

# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)

# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
    feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)

# Missing value imputation
imputer = FillUnknown()

# To encode the categorical data
one_hot = OneHotEncoder(handle_unknown="ignore")

# To scale numerical columns
scaler = RobustScaler()
In [ ]:
cleanser = Pipeline(
    steps=[
        ("feature_name_standardizer", feature_name_standardizer),
        ("column_dropper", column_dropper),
        ("value_mask", value_masker),
        ("imputation", imputer),
    ]
)
In [ ]:
encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])

preprocessor = ColumnTransformer(
    transformers=[
        ("encoding", encode_transformer, columns_to_encode),
        ("scaling", num_scaler, num_columns),
    ],
    remainder="passthrough",
)
In [ ]:
gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features="auto",
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=seed,
)

# Creating new pipeline with best parameters
model_pipe = Pipeline(
    steps=[
        ("cleanse", cleanser),
        ("preprocess", preprocessor),
        ("model", gbm_tuned_model),
    ]
)
In [ ]:
# Replace "Unknown" with NaN (so it gets handled by the imputer)
X_train_un = X_train_un.replace("Unknown", np.nan)
X_test_un = X_test_un.replace("Unknown", np.nan)
<ipython-input-108-3cf0dd79f76a>:2: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
  X_train_un = X_train_un.replace("Unknown", np.nan)
<ipython-input-108-3cf0dd79f76a>:3: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.
  X_test_un = X_test_un.replace("Unknown", np.nan)
In [ ]:
one_hot = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
In [ ]:
print(X_train_un.select_dtypes(include="category").apply(lambda x: x.cat.categories))
gender                             Index(['F', 'M'], dtype='object')
education_level    Index(['College', 'Doctorate', 'Graduate', 'Hi...
income_category    Index(['$120K +', '$40K - $60K', '$60K - $80K'...
card_category      Index(['Blue', 'Gold', 'Platinum', 'Silver'], ...
dtype: object
In [ ]:
print("Columns in X_train_un:", X_train_un.columns)
Columns in X_train_un: Index(['customer_age', 'gender', 'dependent_count', 'education_level',
       'income_category', 'card_category', 'months_on_book',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'credit_limit', 'total_revolving_bal',
       'avg_open_to_buy', 'total_amt_chng_q4_q1', 'total_trans_amt',
       'total_trans_ct', 'total_ct_chng_q4_q1'],
      dtype='object')
In [ ]:
print("Columns sent for encoding:", columns_to_encode)
if "marital_status" not in columns_to_encode:
    print("⚠️ Warning: 'marital_status' is missing from encoding step!")
Columns sent for encoding: ['gender', 'education_level', 'marital_status', 'income_category', 'card_category']
In [ ]:
X_train_un_transformed = feature_name_standardizer.transform(X_train_un)
print("Transformed column names:", X_train_un_transformed.columns)
Transformed column names: Index(['customer_age', 'gender', 'dependent_count', 'education_level',
       'income_category', 'card_category', 'months_on_book',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'credit_limit', 'total_revolving_bal',
       'avg_open_to_buy', 'total_amt_chng_q4_q1', 'total_trans_amt',
       'total_trans_ct', 'total_ct_chng_q4_q1'],
      dtype='object')
In [ ]:
print("Columns in X_train_un:", X_train_un.columns)
Columns in X_train_un: Index(['customer_age', 'gender', 'dependent_count', 'education_level',
       'income_category', 'card_category', 'months_on_book',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'credit_limit', 'total_revolving_bal',
       'avg_open_to_buy', 'total_amt_chng_q4_q1', 'total_trans_amt',
       'total_trans_ct', 'total_ct_chng_q4_q1'],
      dtype='object')
In [ ]:
print("Columns before column dropper:", X_train_un.columns)
X_train_un = column_dropper.transform(X_train_un)
print("Columns after column dropper:", X_train_un.columns)
Columns before column dropper: Index(['customer_age', 'gender', 'dependent_count', 'education_level',
       'income_category', 'card_category', 'months_on_book',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'credit_limit', 'total_revolving_bal',
       'avg_open_to_buy', 'total_amt_chng_q4_q1', 'total_trans_amt',
       'total_trans_ct', 'total_ct_chng_q4_q1'],
      dtype='object')
Columns after column dropper: Index(['gender', 'education_level', 'income_category', 'card_category',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'total_revolving_bal', 'total_amt_chng_q4_q1',
       'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1'],
      dtype='object')
In [ ]:
X_transformed = feature_name_standardizer.transform(X_train_un)
print("Columns after feature name standardization:", X_transformed.columns)
Columns after feature name standardization: Index(['gender', 'education_level', 'income_category', 'card_category',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'total_revolving_bal', 'total_amt_chng_q4_q1',
       'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1'],
      dtype='object')
In [ ]:
print("Columns sent for encoding:", columns_to_encode)
Columns sent for encoding: ['gender', 'education_level', 'marital_status', 'income_category', 'card_category']
In [ ]:
# Ensure 'marital_status' is not dropped and is correctly named
# 1. Check the `columns_to_drop` list and remove 'marital_status' if present.
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age",
     "Marital_Status",  # Remove if present and you intend to keep the column
]
# 2. Check for any renaming of 'marital_status' during data processing and adjust the pipeline.
# If you've used FeatureNamesStandardizer, it's likely renamed to 'marital_status'

# Check your ColumnTransformer definition for correct column names:
# Modify `num_columns` and `cat_columns` to reflect correct names:
num_columns = [
    "total_relationship_count",
    "months_inactive_12_mon",
    "contacts_count_12_mon",
    "total_revolving_bal",
    "total_amt_chng_q4_q1",
    "total_trans_amt",
    "total_trans_ct",
    "total_ct_chng_q4_q1",
]
cat_columns = [
    "gender",
    "education_level",
    "income_category",
    "card_category",
    "Marital_Status"  # Make sure this is included
]
In [ ]:
print("Columns in X_train_un:", X_train_un.columns.tolist())
Columns in X_train_un: ['gender', 'education_level', 'income_category', 'card_category', 'total_relationship_count', 'months_inactive_12_mon', 'contacts_count_12_mon', 'total_revolving_bal', 'total_amt_chng_q4_q1', 'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1']
In [ ]:
X_transformed = feature_name_standardizer.transform(X_train_un)
print("Columns after feature name standardization:", X_transformed.columns.tolist())
Columns after feature name standardization: ['gender', 'education_level', 'income_category', 'card_category', 'total_relationship_count', 'months_inactive_12_mon', 'contacts_count_12_mon', 'total_revolving_bal', 'total_amt_chng_q4_q1', 'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1']
In [ ]:
print("Columns to encode:", columns_to_encode)
Columns to encode: ['gender', 'education_level', 'marital_status', 'income_category', 'card_category']
In [ ]:
print("Original dataset columns:", df.columns.tolist())  # Your original DataFrame
print("Columns in X_train_un:", X_train_un.columns.tolist())
Original dataset columns: ['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
Columns in X_train_un: ['gender', 'education_level', 'income_category', 'card_category', 'total_relationship_count', 'months_inactive_12_mon', 'contacts_count_12_mon', 'total_revolving_bal', 'total_amt_chng_q4_q1', 'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1']
In [ ]:
print("Original dataset columns:", df.columns.tolist())
Original dataset columns: ['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
In [ ]:
print("Columns in data:", data.columns.tolist())
Columns in data: ['attrition_flag', 'customer_age', 'gender', 'dependent_count', 'education_level', 'income_category', 'card_category', 'months_on_book', 'total_relationship_count', 'months_inactive_12_mon', 'contacts_count_12_mon', 'credit_limit', 'total_revolving_bal', 'avg_open_to_buy', 'total_amt_chng_q4_q1', 'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1']
In [ ]:
print([col for col in data.columns if "marital" in col.lower()])
[]
In [ ]:
print(data.columns.tolist())
['attrition_flag', 'customer_age', 'gender', 'dependent_count', 'education_level', 'income_category', 'card_category', 'months_on_book', 'total_relationship_count', 'months_inactive_12_mon', 'contacts_count_12_mon', 'credit_limit', 'total_revolving_bal', 'avg_open_to_buy', 'total_amt_chng_q4_q1', 'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1']
In [ ]:
data.columns = data.columns.str.strip().str.replace(" ", "_").str.lower()
print(data.columns.tolist())  # Print cleaned column names
['attrition_flag', 'customer_age', 'gender', 'dependent_count', 'education_level', 'income_category', 'card_category', 'months_on_book', 'total_relationship_count', 'months_inactive_12_mon', 'contacts_count_12_mon', 'credit_limit', 'total_revolving_bal', 'avg_open_to_buy', 'total_amt_chng_q4_q1', 'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1']
In [ ]:
print(data.head())
      attrition_flag  customer_age gender  dependent_count education_level  \
0  Existing Customer            45      M                3     High School   
1  Existing Customer            49      F                5        Graduate   
2  Existing Customer            51      M                3        Graduate   
3  Existing Customer            40      F                4     High School   
4  Existing Customer            40      M                3      Uneducated   

  income_category card_category  months_on_book  total_relationship_count  \
0     $60K - $80K          Blue              39                         5   
1  Less than $40K          Blue              44                         6   
2    $80K - $120K          Blue              36                         4   
3  Less than $40K          Blue              34                         3   
4     $60K - $80K          Blue              21                         5   

   months_inactive_12_mon  contacts_count_12_mon  credit_limit  \
0                       1                      3     12691.000   
1                       1                      2      8256.000   
2                       1                      0      3418.000   
3                       4                      1      3313.000   
4                       1                      0      4716.000   

   total_revolving_bal  avg_open_to_buy  total_amt_chng_q4_q1  \
0                  777        11914.000                 1.335   
1                  864         7392.000                 1.541   
2                    0         3418.000                 2.594   
3                 2517          796.000                 1.405   
4                    0         4716.000                 2.175   

   total_trans_amt  total_trans_ct  total_ct_chng_q4_q1  
0             1144              42                1.625  
1             1291              33                3.714  
2             1887              20                2.333  
3             1171              20                2.333  
4              816              28                2.500  
In [ ]:
# Ensure 'marital_status' is NOT in 'columns_to_drop'
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age",

]
In [ ]:
model_pipe.fit(X_train_un, y_train_un)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/pandas/core/indexes/base.py in get_loc(self, key)
   3804         try:
-> 3805             return self._engine.get_loc(casted_key)
   3806         except KeyError as err:

index.pyx in pandas._libs.index.IndexEngine.get_loc()

index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'marital_status'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/sklearn/utils/_indexing.py in _get_column_indices(X, key)
    363             for col in columns:
--> 364                 col_idx = all_columns.get_loc(col)
    365                 if not isinstance(col_idx, numbers.Integral):

/usr/local/lib/python3.11/dist-packages/pandas/core/indexes/base.py in get_loc(self, key)
   3811                 raise InvalidIndexError(key)
-> 3812             raise KeyError(key) from err
   3813         except TypeError:

KeyError: 'marital_status'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-145-5f98bb947167> in <cell line: 0>()
----> 1 model_pipe.fit(X_train_un, y_train_un)

/usr/local/lib/python3.11/dist-packages/sklearn/base.py in wrapper(estimator, *args, **kwargs)
   1387                 )
   1388             ):
-> 1389                 return fit_method(estimator, *args, **kwargs)
   1390 
   1391         return wrapper

/usr/local/lib/python3.11/dist-packages/sklearn/pipeline.py in fit(self, X, y, **params)
    652 
    653         routed_params = self._check_method_params(method="fit", props=params)
--> 654         Xt = self._fit(X, y, routed_params, raw_params=params)
    655         with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    656             if self._final_estimator != "passthrough":

/usr/local/lib/python3.11/dist-packages/sklearn/pipeline.py in _fit(self, X, y, routed_params, raw_params)
    586             )
    587 
--> 588             X, fitted_transformer = fit_transform_one_cached(
    589                 cloned_transformer,
    590                 X,

/usr/local/lib/python3.11/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    310 
    311     def __call__(self, *args, **kwargs):
--> 312         return self.func(*args, **kwargs)
    313 
    314     def call_and_shelve(self, *args, **kwargs):

/usr/local/lib/python3.11/dist-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, params)
   1549     with _print_elapsed_time(message_clsname, message):
   1550         if hasattr(transformer, "fit_transform"):
-> 1551             res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
   1552         else:
   1553             res = transformer.fit(X, y, **params.get("fit", {})).transform(

/usr/local/lib/python3.11/dist-packages/sklearn/utils/_set_output.py in wrapped(self, X, *args, **kwargs)
    317     @wraps(f)
    318     def wrapped(self, X, *args, **kwargs):
--> 319         data_to_wrap = f(self, X, *args, **kwargs)
    320         if isinstance(data_to_wrap, tuple):
    321             # only wrap the first output for cross decomposition

/usr/local/lib/python3.11/dist-packages/sklearn/base.py in wrapper(estimator, *args, **kwargs)
   1387                 )
   1388             ):
-> 1389                 return fit_method(estimator, *args, **kwargs)
   1390 
   1391         return wrapper

/usr/local/lib/python3.11/dist-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y, **params)
    991         n_samples = _num_samples(X)
    992 
--> 993         self._validate_column_callables(X)
    994         self._validate_remainder(X)
    995 

/usr/local/lib/python3.11/dist-packages/sklearn/compose/_column_transformer.py in _validate_column_callables(self, X)
    550                 columns = columns(X)
    551             all_columns.append(columns)
--> 552             transformer_to_input_indices[name] = _get_column_indices(X, columns)
    553 
    554         self._columns = all_columns

/usr/local/lib/python3.11/dist-packages/sklearn/utils/_indexing.py in _get_column_indices(X, key)
    370 
    371         except KeyError as e:
--> 372             raise ValueError("A given column is not a column of the dataframe") from e
    373 
    374         return column_indices

ValueError: A given column is not a column of the dataframe
In [ ]:
data["marital_status"] = "Unknown"
In [ ]:
X_train_un["marital_status"] = data["marital_status"]
In [ ]:
print(columns_to_encode)
['gender', 'education_level', 'marital_status', 'income_category', 'card_category']
In [ ]:
columns_to_encode.append("marital_status")
In [ ]:
X_train_un["marital_status"] = X_train_un["marital_status"].astype(str).fillna("Unknown")
In [ ]:
gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features="sqrt",  # 🔥 Fixed here
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=42,
)
In [ ]:
gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features="log2",
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=42,
)
In [ ]:
# Use half the features
gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features=0.5,
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=42,
)

# Or use a fixed number of features
gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features=5,
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=42,
)
In [ ]:
gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features=None,
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=42,
)
In [ ]:
from sklearn.pipeline import Pipeline  # Import Pipeline
from sklearn.ensemble import GradientBoostingClassifier  # Import GradientBoostingClassifier

# ... (other pipeline steps) ...

model_pipe = Pipeline(
    steps=[
        # ... (other pipeline steps) ...,
        (
            "gbm",
            GradientBoostingClassifier(
                n_estimators=700,
                max_features=None,  # Change to a valid value: None, int, float, 'sqrt', 'log2'
                max_depth=25,
                min_samples_split=2,
                min_samples_leaf=15,
                random_state=42,
            ),
        ),
        # ... (other pipeline steps) ...,
    ]
)

# ... (rest of your code) ...
In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier

# ... (other imports) ...

# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()

# Create transformers for numerical and categorical features
numerical_transformer = Pipeline(steps=[])  # No transformation for numerical features in this case

categorical_transformer = Pipeline(
    steps=[
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore'))  # One-hot encode categorical features
    ]
)

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Create the final pipeline with the preprocessor and the model
model_pipe = Pipeline(
    steps=[
        ("preprocessor", preprocessor),  # Apply preprocessing
        (
            "gbm",
            GradientBoostingClassifier(
                n_estimators=700,
                max_features=None,
                max_depth=25,
                min_samples_split=2,
                min_samples_leaf=15,
                random_state=42,
            ),
        ),
    ]
)

# ... (rest of your code) ...
In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler # Import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier

# ... (other imports) ...

# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()

# Create transformers for numerical and categorical features
# Add StandardScaler to numerical_transformer
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])

categorical_transformer = Pipeline(
    steps=[
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore'))  # One-hot encode categorical features
    ]
)

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Create the final pipeline with the preprocessor and the model
model_pipe = Pipeline(
    steps=[
        ("preprocessor", preprocessor),  # Apply preprocessing
        (
            "gbm",
            GradientBoostingClassifier(
                n_estimators=700,
                max_features=None,
                max_depth=25,
                min_samples_split=2,
                min_samples_leaf=15,
                random_state=42,
            ),
        ),
    ]
)

# ... (rest of your code) ...
In [ ]:
model_pipe.fit(X_train_un, y_train_un)
Out[ ]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['total_relationship_count',
                                                   'months_inactive_12_mon',
                                                   'contacts_count_12_mon',
                                                   'total_revolving_bal',
                                                   'total_amt_chng_q4_q1',
                                                   'total_trans_amt',
                                                   'total_trans_ct',
                                                   'total_ct_chng_q4_q1']),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['gender', 'education_level',
                                                   'income_category',
                                                   'card_category',
                                                   'marital_status'])])),
                ('gbm',
                 GradientBoostingClassifier(max_depth=25, min_samples_leaf=15,
                                            n_estimators=700,
                                            random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer

# ... (other imports) ...

# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()

# Create transformers for numerical and categorical features
# Add StandardScaler to numerical_transformer
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy='most_frequent')), # Impute missing values before OneHotEncoding
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore'))  # One-hot encode categorical features
    ]
)

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer

# ... (other imports) ...

# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()

# Create transformers for numerical and categorical features
# Add StandardScaler to numerical_transformer
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy='most_frequent')), # Impute missing values before OneHotEncoding
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore'))  # One-hot encode categorical features
    ]
)

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ])
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer

# ... (other imports) ...

# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()

# Create transformers for numerical and categorical features
# Add StandardScaler to numerical_transformer
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy='most_frequent')), # Impute missing values before OneHotEncoding
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore'))  # One-hot encode categorical features
    ]
)

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer

# ... (other imports) ...

# Assuming categorical_features is a list of your categorical column names
categorical_features = X_train_un.select_dtypes(include=['object', 'category']).columns.tolist()
# Define numerical_features by selecting columns that are not categorical
numerical_features = X_train_un.select_dtypes(exclude=['object', 'category']).columns.tolist()
# ... (rest of the code) ...
In [ ]:
print(X_test.shape)
(2026, 13)
In [ ]:
if X_test.shape[0] == 0:
    print("Error: X_test is empty!")
    # Try using a subset of the training set as a fallback
    X_test = X_train_un[:10]  # Use first 10 rows as a workaround
    y_test = y_train_un[:10]
In [ ]:
print(f"X_train_un shape: {X_train_un.shape}")
print(f"y_train_un shape: {y_train_un.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
X_train_un shape: (8101, 13)
y_train_un shape: (8101,)
X_test shape: (2026, 13)
y_test shape: (2026,)
In [ ]:
print(set(X_train_un.columns) - set(X_test.columns))  # Columns in train but missing in test
print(set(X_test.columns) - set(X_train_un.columns))  # Columns in test but missing in train
set()
set()
In [ ]:
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")
X_test shape: (2026, 13)
y_test shape: (2026,)
In [ ]:
correct_column_name = "attrition_flag"  # Replace with the actual column name you want to drop
X_train_un = data.drop(columns=[correct_column_name])
In [ ]:
target_col = [col for col in data.columns if "target" in col.lower()]
print(target_col)  # Check which column(s) match

if target_col:
    X_train_un = data.drop(columns=target_col[0])  # Use the found column
else:
    print("⚠️ No column found with 'target' in the name!")
[]
⚠️ No column found with 'target' in the name!
In [ ]:
print(type(data))  # Should be <class 'pandas.DataFrame'>
print(data.shape)  # Check number of rows/columns
<class 'pandas.core.frame.DataFrame'>
(10127, 19)
In [ ]:
if "target_column" in data.columns:
    X_train_un = data.drop(columns=["target_column"])
else:
    print("⚠️ 'target_column' not found! Available columns:", data.columns)
⚠️ 'target_column' not found! Available columns: Index(['attrition_flag', 'customer_age', 'gender', 'dependent_count',
       'education_level', 'income_category', 'card_category', 'months_on_book',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'credit_limit', 'total_revolving_bal',
       'avg_open_to_buy', 'total_amt_chng_q4_q1', 'total_trans_amt',
       'total_trans_ct', 'total_ct_chng_q4_q1', 'marital_status'],
      dtype='object')
In [ ]:
data.rename(columns={"wrong_column_name": "target_column"}, inplace=True)
In [ ]:
print(data.columns)  # Print the column names
Index(['attrition_flag', 'customer_age', 'gender', 'dependent_count',
       'education_level', 'income_category', 'card_category', 'months_on_book',
       'total_relationship_count', 'months_inactive_12_mon',
       'contacts_count_12_mon', 'credit_limit', 'total_revolving_bal',
       'avg_open_to_buy', 'total_amt_chng_q4_q1', 'total_trans_amt',
       'total_trans_ct', 'total_ct_chng_q4_q1', 'marital_status'],
      dtype='object')
In [ ]:
X_train_un = data.drop(columns=["attrition_flag"])
In [ ]:
from sklearn.preprocessing import OneHotEncoder
# ... other imports ...

# ... your pipeline definition ...

categorical_features = ['gender', 'education_level', 'marital_status', 'income_category', 'card_category']

categorical_transformer = Pipeline(
    steps=[
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore')) # handle_unknown='ignore' added
    ]
)
# Define the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", robust_scaler, num_columns),
        ("cat", categorical_transformer, categorical_features),
    ]
)
# Define the pipeline
model_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)
# ... rest of your code ...
In [ ]:
categorical_features = ['gender', 'education_level', 'income_category', 'card_category']

# Find the actual column name if it exists with different casing
for col in X_train.columns:
    if col.lower().strip() == 'marital_status':
        categorical_features.insert(2, col)  # Insert at the correct position
        break  # Exit the loop once found

# If the column is still not found, print a warning and proceed without it
if 'marital_status' not in [c.lower().strip() for c in categorical_features]:
    print("Warning: 'marital_status' column not found. Proceeding without it.")

categorical_transformer = Pipeline(
    steps=[
        ("onehot", OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", robust_scaler, num_columns),
        ("cat", categorical_transformer, categorical_features),
    ]
)

model_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)
Warning: 'marital_status' column not found. Proceeding without it.
In [ ]:
# Fit the pipeline to the training data
model_pipe.fit(X_train, y_train)

# Now you can score on the test data
print(
    "Accuracy on Test is: {}%".format(round(model_pipe.score(X_test, y_test) * 100, 0))
)
Accuracy on Test is: 89.0%
/usr/local/lib/python3.11/dist-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
pred_train_p = model_pipe.predict_proba(X_train)[:, 1] > 0.5  # Use X_train instead of X_train_un
pred_test_p = model_pipe.predict_proba(X_test)[:, 1] > 0.5

pred_train_p = np.round(pred_train_p)
pred_test_p = np.round(pred_test_p)

train_acc_p = accuracy_score(y_train, pred_train_p)  # Use y_train instead of y_train_un
test_acc_p = accuracy_score(y_test, pred_test_p)

train_recall_p = recall_score(y_train, pred_train_p)  # Use y_train instead of y_train_un
test_recall_p = recall_score(y_test, pred_test_p)
In [ ]:
print("Recall on Test is: {}%".format(round(test_recall_p * 100, 0)))
Recall on Test is: 74.0%
In [ ]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd  # Make sure pandas is imported

# ... (Your previous code) ...

# Assuming 'data' is your original DataFrame, create 'data_clean'
# Replace this with the appropriate operations to clean your data
data_clean = data.copy()  # Example: Create a copy of 'data'

# ... (Rest of your code to generate the heatmap) ...

# Select only numerical features for correlation calculation
numerical_data = data_clean.select_dtypes(include=np.number)

mask = np.zeros_like(numerical_data.corr(), dtype=bool) # Use numerical_data.corr()
mask[np.triu_indices_from(mask)] = True

sns.set(rc={"figure.figsize": (15, 15)})

sns.heatmap(
    numerical_data.corr(),  # Use numerical_data.corr()
    cmap=sns.diverging_palette(20, 220, n=200),
    annot=True,
    mask=mask,
    center=0,
)
plt.show()
In [ ]:
mask = np.zeros_like(numerical_data.corr(), dtype=bool)  # Change np.bool to bool
In [ ]:
# Assuming X_test and X_train have the same columns but some are categorical

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Identify categorical features (e.g., 'gender', 'education_level')
# 'marital_status' has been removed from the list as it is causing the error
categorical_features = ['gender', 'education_level', 'income_category', 'card_category']

# Create a ColumnTransformer to apply OneHotEncoder to categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', [col for col in X_train.columns if col not in categorical_features]), # Passthrough for numerical
        ('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features),  # One-hot encode categorical
    ])

# Create a pipeline with the preprocessor and your model
model_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42)),  # Replace with your desired model
])

# Fit the pipeline to your training data
model_pipe.fit(X_train, y_train)

# Now you can make predictions on your test data
y_pred_gb = model_pipe.predict_proba(X_test)[:, 1]

# ... (rest of your code for other models and y_pred_all calculation) ...
In [ ]:
from sklearn.metrics import average_precision_score, roc_auc_score

# Assuming 'model_pipe' is your trained model and 'X_test' is your test data
y_pred_all = model_pipe.predict_proba(X_test)[:, 1]  # Get predicted probabilities for class 1

# Now you can use the function
average_precision_score(y_test, y_pred_all), roc_auc_score(y_test, y_pred_all)
Out[ ]:
(0.7411274574943183, 0.9294695428028761)

Insights and Recommendations:

Insights:

1.The proportion of attrited customers by gender there are 14.4% more male than female who have changed.

  1. The ratio of the customers are equally distributed. 3.Customers related to Graduate level is first and next is post graduate level. 4.Maried stands first then single and other categories of customers.
  2. Incomelevel is first targeted on 60-till 80 and then less then 40. 6.Customers with high income will not leave the credit card

Recommendation: 1.There should be a good coordinatoon and relationship betweek Bank and customer.

  1. Customer should be informed about and changes and new offers
  2. Cashback is very good option that can be given to customer for better usage of credit cards 4.Rewards for customers with regular payment history would be added advantage. 5.Convertion to EMI option with 0% interest would be added advantage. 6.Additional cards offer would help the customer for their personal usage. 7.By using our model maximum amount of customers can be reached for different advantages. 8.By giving the advantaged attrition rate can be reduced